FIX: Reduce input of to_tsvector to follow limits (PR #13806)

Long posts may have cooked fields that produce tsvectors longer than the maximum size of 1MiB (1,048,576 bytes). This commit uses just the first million characters of the scrubbed cooked text for indexing.

Reducing the size to exactly 1MB (1_048_576) is not sufficient because sometimes the output tsvector may be longer than the input and this gives us some breathing room.

GitHub

Shouldn’t we be doing this on all the weights, somewhere around here instead?

I looked at all update_index calls and none of the other ones pass strings that can be very long. If you want, I can do it there.

Can we truncate to SiteSetting.max_post_length instead? We don’t need to ensure search is accurate for posts which have some how gone way over the limit.

But we could try to make it reliable for posts that are just under the limit. max_post_length is for raw body, and the cooked html tends to be bigger, so trimming it to max length can potentially ignore valid content at the end of the post.

@CvX You are right, that is exactly why I did not use SiteSetting.max_post_length. Even a simple _test_ will be expanded to something like <p><i>test</i></p>, a 3x increase in length. Oneboxes can increase that length even more.

I’m not following here, the HTML elements are scrubbed before we attempt to index the data so the HTML tags should not matter in this case. My main objection here is that I don’t think we should carry code just to handle posts which has somehow bypassed SiteSetting.max_post_length. Do we know what kind of content in a post that is below SiteSetting.max_post_length can end up with a scrubbed cooked that will burst the tsvector size limit?

My main objection here is that I don’t think we should carry code just to handle posts which has somehow bypassed SiteSetting.max_post_length.

I can understand that, and we’re doing our best to avoid creating excessively long posts, but some might still sneak by so it makes sense to 1. not throw runtime error and 2. try to index as much as we can.

I’m interested in the type of posts that are causing this because we have not encountered this before so I think we need to go alittle deeper here to see if we have a bug that is allowing posts which are too long. We already do not throw a runtime error if we fail to index a post but we do warn in our logs about the exceptions.

I’m interested in the type of posts that are causing this

This fix comes from an error that was triggered for a PM written by the system user with the full log of a backup operation. Since we skip the validations in those cases, it generated this humungous post. We are in the process of including the backup log as an attachment instead of a block quote.

I can also see this happening on imported posts. We usually remove most restrictions because we want to move 100% of the content.

Overall, I think this fix is fine.