FIX: Match on any segments after search term is processed by cppjieba. (PR #14677)

This commit only affects Chinese and Japanese where the search terms are processed by cppjieba prior to searching.

The 白名单 term becomes 名单 白名单 after it is processed by cppjieba. However, 白名单 is not tokenized as such by cppjieba when it appears in a string of text. The workaround we took here is to match on either 名单 or 白名单 when terms are processed by cppjieba.

The change here will result in partial matches of terms making search slightly less accurate. For example, 社區指南 becomes 社區 指南 when processed by cppjieba and the change here means that we will match on either 社區 or 指南 instead of matching on both terms. This is a concious trade-off which we’re making where we think having a poor result is better than having no result for the Chinese and Japanese language. To properly support search for Chinese and Japanese languages, we may look into integrating the PGroonga extension into Discourse in the future.

GitHub

@udan11 Would you be able to review this for me? Thanks!

Sorry but I am closing this, I am questioning if we should even be using the query tokenizer anywhere. I think we should always use the mix tokenizer, at least that way behavior is consistent.