From: | Sam Saffron <sam(dot)saffron(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Contractions in full text search result in very surprising stemming |
Date: | 2023-01-31 06:27:40 |
Message-ID: | CAAtdryOnYDJz7C8PLmYxGj8GGU=CTTVsxRF+5ys7XZWcTkHp=Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Per:
```
select ts_debug('english', 'you''re a star');
ts_debug
-----------------------------------------------------------------------
(asciiword,"Word, all ASCII",you,{english_stem},english_stem,{})
(blank,"Space symbols",',{},,)
(asciiword,"Word, all ASCII",re,{english_stem},english_stem,{re})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",a,{english_stem},english_stem,{})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",star,{english_stem},english_stem,{star})
(7 rows)
```
And:
https://snowballstem.org/demo.html
https://snowballstem.org/texts/apostrophe.html
Snowball stemmer has special handling for contraction built in, but
out-of-the-box due to the order of filters it never gets access to the
data.
That means that a word such as `you're` stems incorrectly down to
`re`. Prefix matches end up hitting lots of surprising words.
I know this is a big can of worms... and unlikely easy to resolve ...
the latest changes to `to_tsquery` (replacing & with <=>) are already
a bitter enough pill for lots to swallow and another breaking change
is not something many desire. However, it feels like an oversight (at
least documentation wise). Perhaps a good starting point might be to
clearly document the issue and workaround?
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-01-31 06:40:56 | Re: Generating code for query jumbling through gen_node_support.pl |
Previous Message | Amit Kapila | 2023-01-31 05:42:24 | Re: pub/sub - specifying optional parameters without values. |