Re: fixing tsearch locale support

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Peter Eisentraut" <peter(at)eisentraut(dot)org>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fixing tsearch locale support
Date: 2025-08-18 15:56:01
Message-ID: 15e97660-9e3c-43a2-8cad-7b33fc7f7476@manitou-mail.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut wrote:

> There is a PG18 open item to document this possible upgrade incompatibility.
>
> I think the following text could be added to the release notes:
>
> """
> The locale implementation underlying full-text search was improved. It
> now observes the locale provider configured for the database. It was
> previously hardcoded to use the configured libc LC_CTYPE setting
> [...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-------+-----------+-------------
word | ABCD—EFGH | {abcd—efgh}

with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-----------+-------+---------
asciiword | ABCD | {abcd}
blank | — |
asciiword | EFGH | {efgh}

OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {ÉtÉ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {été}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-08-18 16:05:16 Re: Proposal: Extending the PostgreSQL Protocol with Command Metadata
Previous Message Jacob Champion 2025-08-18 15:38:25 Re: Support getrandom() for pg_strong_random() source