Contractions in full text search result in very surprising stemming

From: Sam Saffron <sam(dot)saffron(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Contractions in full text search result in very surprising stemming
Date: 2023-01-31 06:27:40
Message-ID: CAAtdryOnYDJz7C8PLmYxGj8GGU=CTTVsxRF+5ys7XZWcTkHp=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Per:

```
select ts_debug('english', 'you''re a star');
ts_debug
-----------------------------------------------------------------------
(asciiword,"Word, all ASCII",you,{english_stem},english_stem,{})
(blank,"Space symbols",',{},,)
(asciiword,"Word, all ASCII",re,{english_stem},english_stem,{re})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",a,{english_stem},english_stem,{})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",star,{english_stem},english_stem,{star})
(7 rows)
```

And:

https://snowballstem.org/demo.html
https://snowballstem.org/texts/apostrophe.html

Snowball stemmer has special handling for contraction built in, but
out-of-the-box due to the order of filters it never gets access to the
data.

That means that a word such as `you're` stems incorrectly down to
`re`. Prefix matches end up hitting lots of surprising words.

I know this is a big can of worms... and unlikely easy to resolve ...
the latest changes to `to_tsquery` (replacing & with <=>) are already
a bitter enough pill for lots to swallow and another breaking change
is not something many desire. However, it feels like an oversight (at
least documentation wise). Perhaps a good starting point might be to
clearly document the issue and workaround?

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-01-31 06:40:56 Re: Generating code for query jumbling through gen_node_support.pl
Previous Message Amit Kapila 2023-01-31 05:42:24 Re: pub/sub - specifying optional parameters without values.