From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg_collation.collversion for C.UTF-8 |
Date: | 2023-05-26 17:43:09 |
Message-ID: | 56ef55fc2212334e1f72b3d8128106e9ab37fe5a.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > What should we do with locales like C.UTF-8 in both libc and ICU?
>
> I vote for passing those to the existing C-specific code paths,
Great, this would be a big step toward solving the ICU usability issues
in this thread:
https://postgr.es/m/000b01d97465%24c34bbd60%2449e33820%24%40pcorp.us
> Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
> Case-independent might be good, but we haven't accepted such in
> the past, so I don't feel strongly about it. (Arguably, passing
> lower case "c" to the provider would provide an "out" to anybody
> who dislikes our choices here.)
Patch attached with your suggestions. It's based on the first patch in
the series I posted here:
https://postgr.es/m/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.camel@j-davis.com
We still need to consider backwards compatibility. If someone has a
collation with locale name C.UTF-8 in an earlier version, any change to
the interpretation of that locale name after an upgrade carries a
corruption risk. The risks are different in ICU vs libc:
For ICU: iculocale=C in an earlier version was a mistake that must
have been explicitly requested by the user. However, if such a mistake
was made, the indexes would have been created using the ICU root
locale, which is very different from the C locale. So reinterpreting
iculocale=C as memcmp() would be likely to result in index corruption.
Patch 0002 (also based on a patch from the series linked above) solves
this with a pg_upgrade check for iculocale=C in versions 15 and
earlier. The upgrade check is not likely to affect many users, and
those it does affect have a mis-defined collation and would benefit
from the check.
For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up. Also, I'm not sure how obscure the cases are even if
there is a difference; perhaps they don't often occur in practice? It's
not clear to me how we mitigate this risk further, though.
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
0001-Interpret-C-locales-consistently-between-ICU-and-lib.patch | text/x-patch | 17.8 KB |
0002-pg_upgrade-check-for-ICU-locale-C-in-versions-15-and.patch | text/x-patch | 4.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kaiting Chen | 2023-05-26 18:41:23 | Re: Is NEW.ctid usable as table_tuple_satisfies_snapshot? |
Previous Message | Peter Geoghegan | 2023-05-26 17:28:58 | Re: Cleaning up nbtree after logical decoding on standby work |