Re: Collation version tracking for macOS

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jeremy Schneider <schneider(at)ardentperf(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Nasby, Jim" <nasbyj(at)amazon(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation version tracking for macOS
Date: 2022-11-30 00:50:51
Message-ID: CA+hUKGJtmxV43_zjRdJxxEzpAZoQ5BUhzM2N9_Njh85oTt564g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 30, 2022 at 1:32 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:
> > On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > > Here's what I found for the 'ar' locale (firstminor/lastminor are
> > > the
> > > icu library versions, firstcollversion/lastcollversion are their
> > > respective collation versions for the given locale):
> > >
> > > firstminor | lastminor | firstcollversion | lastcollversion
> > > ------------+-----------+------------------+-----------------
> > > 60.1 | 60.3 | 153.80.32 | 153.80.32.1
> > > 64.1 | 64.2 | 153.96.35 | 153.97.35.8
> > > 68.1 | 68.2 | 153.14.38 | 153.14.38.8
> > > (3 rows)
> >
> > Right, this fits with what I said earlier: the third component is
> > CLDR
> > major, fourth component is CLDR minor except from ICU 61 on the CLDR
> > minor is << 3'd (X.X.38.8 means CLDR 38.1).
>
> What about 64.1 -> 64.2? That changed the *second* component from 96 ->
> 97. Are we agreed that collations can materially change in minor ICU
> releases?

That means that the Unicode/UCA version switched from 12 to 12.1, so
that's a confirmed sighting of a UCA minor version bump within one ICU
major version. Let's see what the purpose of that Unicode minor
release was[1]:

"Unicode 12.1 adds exactly one character, for a total of 137,929 characters.

The new character added to Version 12.1 is:

U+32FF SQUARE ERA NAME REIWA

Version 12.1 adds that single character to enable software to be
rapidly updated to support the new Japanese era name in calendrical
systems and date formatting. The new Japanese era name was officially
announced on April 1, 2019, and is effective as of May 1, 2019."

Wow!

Wikipedia says[2] "the "rei" character 令 has never appeared before".

The sort order of characters that didn't previously exist is a special
topic. In theory they can't hurt you because you shouldn't have been
using them, but PostgreSQL doesn't enforce that (other systems do), so
you could be exposed to a change from whatever default ordering the
non-existent codepoint had for random implementation reasons to some
deliberate ordering which may or may not be the same.

Are all Unicode/UCA minor versions of that type? I dunno. Something
to research, but [3] is far too vague and [4] is about other problems.

[1] https://unicode.org/versions/Unicode12.1.0/
[2] https://en.wikipedia.org/wiki/Reiwa
[3] https://www.unicode.org/versions/#major_minor
[4] https://www.unicode.org/policies/stability_policy.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-11-30 00:54:26 Re: Collation version tracking for macOS
Previous Message Michael Paquier 2022-11-30 00:50:34 Re: Add LZ4 compression in pg_dump