Re: ICU for global collation

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: ICU for global collation
Date: 2022-03-16 14:25:09
Message-ID: 07878ad1-d94d-5a92-565f-c0dfdea8b61b@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15.03.22 18:28, Robert Haas wrote:
> On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut
> <peter(dot)eisentraut(at)enterprisedb(dot)com> wrote:
>> On 14.03.22 19:57, Robert Haas wrote:
>>> 1. What will happen if I set the ICU collation to something that
>>> doesn't match the libc collation? How bad are the consequences?
>>
>> These are unrelated, so there are no consequences.
>
> Can you please elaborate on this?

The code that is aware of ICU generally works like this:

if (locale_provider == ICU)
result = call ICU code
else
result = call libc code
return result

However, there is code out there, both within PostgreSQL itself and in
extensions, that does not do that yet. Ideally, we would eventually
change all that over, but it's not happening now. So we ought to
preserve the ability to set the libc to keep that legacy code working
for now.

This legacy code by definition doesn't know about ICU, so it doesn't
care whether the ICU setting "matches" the libc setting or anything like
that. It will just do its thing depending on its own setting.

The only consequence of settings that don't match is that the different
pieces of code behave semantically inconsistently (e.g., some routine
thinks the data is Greek and other code thinks the data is French). But
that's up to the user to set correctly. And the actual scenarios where
you can actually do anything semantically relevant this way are very
limited.

A second point is that the LC_CTYPE setting tells other parts of libc
what the current encoding is. This affects gettext for example. So you
need to set this to something sensible even if you don't use libc locale
routines otherwise.

>>> 2. If I want to avoid a mismatch between the two, then I will need a
>>> way to figure out which libc collation corresponds to a given ICU
>>> collation. How do I do that?
>>
>> You can specify the same name for both.
>
> Hmm. If every name were valid in both systems, I don't think you'd be
> proposing two fields.

Earlier versions of this patch and predecessor patches indeed had common
fields. But in fact the two systems accept different values if you want
to delve into the advanced features. But for basic usage something like
"en_US" will work for both.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2022-03-16 14:26:59 Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Previous Message Robert Haas 2022-03-16 14:14:56 Re: Corruption during WAL replay