Re: fixing tsearch locale support

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fixing tsearch locale support
Date: 2025-08-26 16:52:11
Message-ID: 98d2c87a-307e-43fd-b2a4-eb22e45aa9ec@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18/08/2025 18:56, Daniel Verite wrote:
>> There is a PG18 open item to document this possible upgrade incompatibility.
>>
>> I think the following text could be added to the release notes:
>>
>> """
>> The locale implementation underlying full-text search was improved. It
>> now observes the locale provider configured for the database. It was
>> previously hardcoded to use the configured libc LC_CTYPE setting
>> [...]
>
> That sounds misleading because LC_CTYPE is still used in 18.
>
> To illustrate in an ICU database, the parser will classify "Em Dash"
> as a separator or not depending on LC_CTYPE.
>
> with LC_CTYPE=C
>
> => select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
> alias | token | lexemes
> -------+-----------+-------------
> word | ABCD—EFGH | {abcd—efgh}
>
>
> with LC_CTYPE=en_US.utf8 (glibc 2.35):
>
> => select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
> alias | token | lexemes
> -----------+-------+---------
> asciiword | ABCD | {abcd}
> blank | — |
> asciiword | EFGH | {efgh}
>
>
> OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
> to better lexemes.
>
> pg17, ICU locale, LC_TYPE=C
>
> => select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
> alias | token | lexemes
> -------+-------+---------
> word | ÉTÉ | {ÉtÉ}
>
> pg18, ICU locale, LC_TYPE=C
>
> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
> alias | token | lexemes
> -------+-------+---------
> word | ÉTÉ | {été}
>
> So maybe the release notes should say
> "now observes the locale provider configured for the database to
> convert strings to lower case".

Is it only used for converting to lower case, or is there any other
operations that need to be mentioned? Converting to upper case too I
presume. (I haven't been following this thread)

We only support two collation providers, libc and ICU right? That makes
Peter's phrasing "In database clusters that use a locale provider other
than libc ..." an unnecessarily complicated way of saying ICU.

Putting those two changes together:

"""
The locale implementation underlying full-text search was improved. It
now observes the collation provider configured for the database for
converting strings to upper and lower case. It was previously hardcoded
to use libc. In databases that use the ICU collation provider and where
the configured ICU locale behaves differently from the LC_CTYPE setting
configured for the database, this could cause changes in behavior of
some functions related to full-text search as well as the pg_trgm
extension. When upgrading such database clusters using pg_upgrade, it
is recommended to reindex all indexes related to full-text search and
pg_trgm after the upgrade.
"""

I wonder if it's clear enough that this applies to full-text search, not
upper/lower case conversions in general. (Is that true?)

It's pretty urgent to get the release notes in shape, people are testing
upgrades with the betas already...

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-08-26 17:26:23 Re: [BUG] Remove self joins causes 'variable not found in subplan target lists' error
Previous Message Tom Lane 2025-08-26 16:41:46 Schedule for PG 18 RC and GA releases