Re: Collation version tracking for macOS

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, "Nasby, Jim" <nasbyj(at)amazon(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation version tracking for macOS
Date: 2022-11-23 05:08:30
Message-ID: CA+hUKGJxg6AbKC9RJ7r1ByVLtvVkThQV+RZO6BKVWYESPCp3Ug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 22, 2022 at 7:34 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> > Problem 2: If ICU 67 ever decides to report a different version for
> > a
> > given collation (would it ever do that? I don't expect so, but ...),
> > we'd be unable to open the collation with the search-by-collversion
> > design, and potentially the database. What is a user supposed to do
> > then? Presumably our error/hint for that would be "please insert the
> > correct ICU library into drive A", but now there is no correct
> > library
>
> Let's say that Postgres is compiled against version 67.X, and the
> sysadmin upgrades the ICU package to 67.Y, which reports a different
> collation version for some locale.
>
> Your current patch makes this impossible for the administrator to fix,
> because there's no way to have two different libraries loaded with the
> same major version number, so it will always pick the compiled-in ICU.
> The user will be forced to accept the new version of the collation, see
> WARNINGs in their logs, and possibly corrupt their indexes.

They could probably also 'pin' the older minor version package using
their package manager (= downgrade) until they're ready to upgrade and
use REFRESH VERSION to certify that they've rebuilt everything
relevant or are OK with risks. Not pretty I admit, but I think the
end result is about the same for search-for-collversion, because I
imagine that (1) the default behaviour on failure to search would
likely be to use the linked library instead and WARN about
[dat]collversion mismatch, so far the same, and (2) the set of people
who would really be prepared to compile their own copy of 67.X instead
of downgrading or REFRESHing (with or without rebuilding) is
vanishingly small.

Two questions I wondered about:

1. *Do* they change ucol_getVersion() values in minor releases? I
tried to find a written policy on that.
https://icu.unicode.org/processes is not encouraging: it gives the
example of a "third digit in an official release number" [changing]
because a CLDR change was incorporated. Hrmph. But that's clearly
not even the modern ICU versioning system (it made a change a bit like
ours in 49, making the first number only major, so maybe that "third"
number is now the second number, AKA minor version), and also that's a
CLDR minor version change; is CLDR minor even in the recipe for
ucol_getVersion()? Even without data changes, I guess that bug fixes
could apply to the UCA logic, and I assume that UCA logic is included
in it. Hmm.

A non-hypothetical example of a CLDR change within an ICU major
version that I've been able to find is:

https://cldr.unicode.org/index/downloads/cldr-38

Here we see that CLDR had a minor version bump 38 -> 38.1, "a very
small number of incremental additions to version 38 to address the
specific bugs listed in Δ38.1", and was included in ICU 68.2. Being a
minor ICU release 68.1 -> 68.2, perhaps you could finish up running
that just with a regular upgrade on typical distros (not a major OS
upgrade), and since PostgreSQL would normally be linked against eg
.68, not .68.1, it'd start using it at the next cluster start when
that symlink is updated to point to .68.2. As it happens, if you
follow the documentation links to see what actually changed in that
particular pair of CLDR+ICU minor releases, it's timezones and locale
stuff other than collations, so wouldn't affect us. Can we find a
chapter and verse that says that ICU would only ever move to a new
CLDR in a minor release, and CLDR would never change order of
pre-existing code points in a minor release?

It might be interesting to see if
https://github.com/unicode-org/icu/tree/release-68-1 and
https://github.com/unicode-org/icu/tree/release-68-2 report a
different ucol_getVersion() for any locale, but not conclusive if it
doesn't; it might be because something in the version pipeline knew
that particular CLDR change didn't affect collators...

This speculation feels pretty useless. Maybe we should go and read
the code or ask an ICU expert, but I'm not against making it
theoretically possible to access two different minor versions at once,
just to cover all the bases for future-proofing.

2. Would package managers ever allow two minor versions to be
installed at once? I highly doubt it; they're probably more
interested in ABI stability so that dependent packages work when
bugfixes are shipped, and that's certainly nailed down at the major
version level. It'd probably be a case of having to compile it
yourself, which seems unlikely to me in the real world. That's why I
left minor version out of earlier patches, but I'm OK with changing
that.

As for how, I think that depends on our modelling decision (see below).

> Search-by-collversion would still be frustrating for the admin, but at
> least it would be possible to fix by compiling their own 67.X and
> asking Postgres to search that library, too. We could make it slightly
> more friendly by having an error that reports the libraries searched
> and the collation versions found, if none of the versions match. We can
> have a GUC that controls whether a failure to find the right version is
> a WARNING or an ERROR.

Good ideas.

> I tried to write up some docs. It's hard to explain why we are exposing
> to the user the collation version and the library version in these
> different ways, and what effects they have.

Always a good test: see how crazy it sounds when translated to user speak.

> The current patch feels like it hasn't decided whether the collation
> version is ucol_getVersion() (collversion) or u_getVersion() (library
> version). The collversion is more prominent in the UI (with its own
> syntax), yet it's just a cross-check for whether to issue a WARNING or
> not; while the library version is hidden in the locale field and it
> actually decides which symbol is called.

Yeah. I agree that it sucks to have two kinds of versions flying
around in the user's mind.

> > Yeah. I just don't like the way it *appears* to be doing something
> > clever, but
> > it doesn't solve any fundamental problem at all because the
> > collversion
> > information is under human control and so it's really doing something
> > stupid.
>
> I assume by "human control" you mean "ALTER COLLATION ... REFRESH
> VERSION". I agree that relying on the admin's declaration is dubious,
> especially when we provide no good advice on how to actually do that
> safely.
>
> But I don't see what using the library version instead buys us here,
> except that library version is part of the LOCALE, and there's no ALTER
> command for that. You could just as easily deprecate/eliminate the
> ALTER COLLATION REFRESH VERSION, and then say that the collversion is
> out of human control, too.
>
> By introducing multiple libraries, I think we need to change that
> syntax anyway, to be something like:
>
> ALTER COLLATION ... SET VERSION TO '...'
>
> or even:
>
> ALTER COLLATION ... FORCE VERSION TO '...'

OK. Time for a new list of the various models we've discussed so far:

1. search-by-collversion: We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.
Whenever opening a collation or database, the system will search some
candidate list of ICU libraries to try to find the one that agrees
with [dat]collversion. When creating a new collation or database, the
system will select one (probably the linked one unless you override
somehow) and record ucol_getVersion() in [dat]collversion. When
searching, it might fail to find a suitable library and ereport; to
fix that, it is the admin's job to somehow expand the set of candidate
libraries. In such a failure case, perhaps it would fall back to
using some default library version (probably the one that is linked,
overridable by GUC?), with a WARNING (unless you turned on ERRORs),
and if you want to shut it up without supplying the right candidate
library, you can still fall back to the REFRESH VERSION hammer (or
maybe that should indeed called FORCE to make it clearer that it's not
a harmless operation where the system holds your hand, you're actually
certifying that you have rebuilt indexes and you know what you're
doing).

The set of candidate versions could perhaps be provided with
extra_icu_library_versions=63,71 OR =63.1,63.2 strings, at least on
Unix systems following the traditional symlink conventions.
Remembering that a typical Unixoid system should have libraries and
symlinks like:

libicui18n.a
libicui18n.so -> libicui18n.so.71.1
libicui18n.so.63 -> libicui18n.so.63.1
libicui18n.so.63.1
libicui18n.so.67 -> libicui18n.so.67.1
libicui18n.so.67.1
libicui18n.so.71 -> libicui18n.so.71.1
libicui18n.so.71.1

The reason I prefer major[.minor] strings over whole library names is
that we need to dlopen two of them so it's a little easier to build
them from those parts than have to supply both names. The reason I
prefer to keep allowing major-only versions to be listed is that it's
good to have the option to just follow minor upgrades automatically.
Or I guess you could make something that can automatically search a
whole directory (which directory?) to find all the suitably named
libraries so you don't ever have to mention versions manually (if you
want "apt-get install libicu72" to be enough with no GUC change
needed) -- is that too weird?

Perhaps we could write functions that can show you the available
versions to demystify the searching mechanism slightly and show how
various numbers relate, something like (warning: I made up numbers for
illustration, they are wrong!):

SELECT * FROM pg_available_icu_libraries()

icu_version unicode_version uca_version cldr_version
67.1 14.0 3.1 38.0
71.1 15.0 4.0 42.0

SELECT * FROM pg_available_icu_collation_versions('en')

icu_version collation_version
67.1 142.42
71.1 153.112

2. lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today. The collversion column is used only for warnings. Warnings
are expected when you used the unversioned ICU provider and upgrade to
a binary linked to a later library. You can clear the warnings by
doing ALTER COLLATION/DATABASE SET [LOCALE_]PROVIDER = ICU63, or with
the REFRESH VERSION hammer.

Not sure how you fit minor versions into that, if we want to support
those. Maybe ICU means "whatever is linked", ICU63 means "whatever
libicui18n.so.63 points to" and ICU63_1 means libicu18n.so.63.1,
something like that, so the user can choose from three levels of
specificity.

3. lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion). Similar
to the above, but it's a separate property and the provider is always
ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
ICU_VERSION.

4. lib-version-in-locale: "63:en" from earlier versions. That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work. It doesn't sound like anyone really likes
this.

5. lib-version-in-collversion: We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion. I haven't analysed this much but I don't think it
has a very nice upgrade path from PG15, and it forces you to decide
whether to store just the major version and not even notice when the
(unstored) minor version changes, or store major.minor and
complain/break down when routine minor upgrades happen. It is a
logical possibility though, once you decide you only want one kind of
version in the system.

I'm willing to update the patch to try one of these out so we can kick
the tyres some more, but I'll wait to see if we can get some consensus
on the way forward. Despite my initial reactions, I'm willing to try
out the search-by-collversion concept if others are keen on it. The
example I worked through in the first paragraph of this email helped
me warm to it a little, and with the observability functions I showed
you might have a chance of figuring out what's going on in some edge
cases. Any other ideas, or votes for these ideas?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-11-23 05:43:29 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Previous Message John Naylor 2022-11-23 04:35:17 Re: Non-decimal integer literals