Re: pg_collation.collversion for C.UTF-8

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_collation.collversion for C.UTF-8
Date: 2023-05-26 17:43:09
Message-ID: 56ef55fc2212334e1f72b3d8128106e9ab37fe5a.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > What should we do with locales like C.UTF-8 in both libc and ICU?
>
> I vote for passing those to the existing C-specific code paths,

Great, this would be a big step toward solving the ICU usability issues
in this thread:

https://postgr.es/m/000b01d97465%24c34bbd60%2449e33820%24%40pcorp.us

> Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
> Case-independent might be good, but we haven't accepted such in
> the past, so I don't feel strongly about it.  (Arguably, passing
> lower case "c" to the provider would provide an "out" to anybody
> who dislikes our choices here.)

Patch attached with your suggestions. It's based on the first patch in
the series I posted here:

https://postgr.es/m/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.camel@j-davis.com

We still need to consider backwards compatibility. If someone has a
collation with locale name C.UTF-8 in an earlier version, any change to
the interpretation of that locale name after an upgrade carries a
corruption risk. The risks are different in ICU vs libc:

For ICU: iculocale=C in an earlier version was a mistake that must
have been explicitly requested by the user. However, if such a mistake
was made, the indexes would have been created using the ICU root
locale, which is very different from the C locale. So reinterpreting
iculocale=C as memcmp() would be likely to result in index corruption.
Patch 0002 (also based on a patch from the series linked above) solves
this with a pg_upgrade check for iculocale=C in versions 15 and
earlier. The upgrade check is not likely to affect many users, and
those it does affect have a mis-defined collation and would benefit
from the check.

For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up. Also, I'm not sure how obscure the cases are even if
there is a difference; perhaps they don't often occur in practice? It's
not clear to me how we mitigate this risk further, though.

Regards,
Jeff Davis

Attachment Content-Type Size
0001-Interpret-C-locales-consistently-between-ICU-and-lib.patch text/x-patch 17.8 KB
0002-pg_upgrade-check-for-ICU-locale-C-in-versions-15-and.patch text/x-patch 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kaiting Chen 2023-05-26 18:41:23 Re: Is NEW.ctid usable as table_tuple_satisfies_snapshot?
Previous Message Peter Geoghegan 2023-05-26 17:28:58 Re: Cleaning up nbtree after logical decoding on standby work