From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Peter Eisentraut" <peter(at)eisentraut(dot)org> |
Cc: | "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: fixing tsearch locale support |
Date: | 2025-08-18 15:56:01 |
Message-ID: | 15e97660-9e3c-43a2-8cad-7b33fc7f7476@manitou-mail.org |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Peter Eisentraut wrote:
> There is a PG18 open item to document this possible upgrade incompatibility.
>
> I think the following text could be added to the release notes:
>
> """
> The locale implementation underlying full-text search was improved. It
> now observes the locale provider configured for the database. It was
> previously hardcoded to use the configured libc LC_CTYPE setting
> [...]
That sounds misleading because LC_CTYPE is still used in 18.
To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.
with LC_CTYPE=C
=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-------+-----------+-------------
word | ABCD—EFGH | {abcd—efgh}
with LC_CTYPE=en_US.utf8 (glibc 2.35):
=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-----------+-------+---------
asciiword | ABCD | {abcd}
blank | — |
asciiword | EFGH | {efgh}
OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.
pg17, ICU locale, LC_TYPE=C
=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {ÉtÉ}
pg18, ICU locale, LC_TYPE=C
select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {été}
So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2025-08-18 16:05:16 | Re: Proposal: Extending the PostgreSQL Protocol with Command Metadata |
Previous Message | Jacob Champion | 2025-08-18 15:38:25 | Re: Support getrandom() for pg_strong_random() source |