FIX: increase chunk size to fetch title tag correctly (PR #14144)

This commit increases the HTML document chunk size so that the title can be retrieved. It also checks the title against </t to check if the title tags broke before the next batch of chunk was retrieved.

GitHub

this feels a bit odd, can you expand on this change? why is it needed? can we add a test?

This is needed because of this condition:

Since Nokogiri fixes the broken HTML tags the additional HTML chunks are not received and it results in broken title tag (the one with </titl in title text). We need to confirm that the extracted title is not from the broken tag hence I’m checking for presence of </t in title text.

Adding a text here is tricky because to test this case we need to stream chunks of HTML from a URL and need to break the chunk just when the title tag appears.

I added a test for this change.

no need for 2 regexes here :slight_smile:

/<title>.*<\/title>/

The above regex will create an issue when title tag is missing and we need to fall back on og:title for extracting title.