Re: pg_collation.collversion for C.UTF-8

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Thomas Munro" <thomas(dot)munro(at)gmail(dot)com>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_collation.collversion for C.UTF-8
Date: 2023-04-22 17:22:24
Message-ID: ac61fb5a-461a-4bdf-9201-68fa67b6242b@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro wrote:

> It looks like for technical reasons
> inside glibc, that couldn't be done before 2.35:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=17318
>
> That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
> by the glibc project) isn't supposed to be versioned, but it's
> extremely unfortunate that a bunch of OSes (Debian and maybe more)
> have been sorting text in some other order under that name for
> years.

Yes. This is consistent with Debian/Ubuntu patches in
glibc/localedata/locales/C

glibc-2.35 is not patched, and upstream has this:
LC_COLLATE
% The keyword 'codepoint_collation' in any part of any LC_COLLATE
% immediately discards all collation information and causes the
% locale to use strcmp/wcscmp for collation comparison. This is
% exactly what is needed for C (ASCII) or C.UTF-8.
codepoint_collation
END LC_COLLATE

But in older versions, glibc doesn't have the locales/C data file.
Debian adds it in debian/patches/localedata/C with that kind of
content:

* glibc 2.31 Debian 11
LC_COLLATE
order_start forward
<U0000>
..
<U007F>
<U0080>
..
<U00FF>
etc...

But as explained in the above-linked bugzilla entry, that did not
result in true byte-comparison semantics, for several reasons
that got fixed in 2.35.

So this looks like a solved problem for anyone starting to use these
collation with glibc 2.35 or newer (or other OSes that don't have a
compatibility issue with them in the first place).
But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not
be having the normal warning about the need to reindex.

I understand that my proposal to version C.* like any other collation
might be erring on the side of caution, but ignoring these collation
changes on at least one major OS does not feel right either.
Maybe we should consider doing platform-dependent checks?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2023-04-22 17:27:06 Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
Previous Message Isaac Morland 2023-04-22 16:53:23 Re: Mark a transaction uncommittable