FIX: Inject extra lexemes for host lexeme. (PR #10198)

GitHub

The title of this pull request changed from “FIX: Search for whole URLs wasn’t working.” to "FIX: Inject extra lexemes for host lexeme.

tsvector = DB.query_single("SELECT #{ranked_index}", ranked_params)[0]

My big question though is, is this injection still too much?

I my post contains sam.i.am.hello should the word hello really find this post? Should this injection be specific to domains eg: https://www.discourse.org? Even if this is for domains … should org be included as well? what about query params… surely we don’t want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

@SamSaffron IMO the searches which I’ve added above are all valid and makes search better.

Realized we need a safe guard here to prevent extreme cases. 'some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end`

discourse_development=# SELECT * FROM TS_DEBUG('https://domain.com?a.b.c.g.e.f.h=1');
  alias   |    description    |     token     | dictionaries | dictionary |     lexemes     
----------+-------------------+---------------+--------------+------------+-----------------
 protocol | Protocol head     | https://      | {}           |            | 
 host     | Host              | domain.com    | {simple}     | simple     | {domain.com}
 blank    | Space symbols     | ?             | {}           |            | 
 file     | File or path name | a.b.c.g.e.f.h | {simple}     | simple     | {a.b.c.g.e.f.h}
 blank    | Space symbols     | =             | {}           |            | 
 uint     | Unsigned integer  | 1             | {simple}     | simple     | {1}

Ahh I thought PG was smart enough to only retrieve the host… looks like it treats query params as file here. I wonder if we can identify what type a lexeme is.

If my post contains sam.i.am.hello should the word hello really find this post?

I think so since this was one of the bug reports we got previously https://meta.discourse.org/t/discourses-internal-search-does-not-find-the-phrase-pagedowncustom-but-google-does/35406/8.

what about query params… surely we don’t want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

Query params are tricky and it seems like the PG default parser isn’t very smart.

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org?test.dot=1');
                to_tsvector                
-------------------------------------------
 '1':3 'meta.discourse.org':1 'test.dot':2
(1 row)

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org/?test.dot=1');
                                to_tsvector                                 
----------------------------------------------------------------------------
 '/?test.dot=1':3 'meta.discourse.org':2 'meta.discourse.org/?test.dot=1':1
(1 row)

I guess one thing we can do is to drop query params from the search index?

@SamSaffron I think we should just strip the query string from URLs. The following seems to work as expected

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1

However, once a path is present

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1

The parsing here is really inconsistent and complex query strings will end up generating noise in our search data.

I still feel we should refine this further, but I don’t want to block the further optimisation of the excerpt parsing, can you go ahead and merge?