Re: Collation version tracking for macOS

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Nasby, Jim" <nasbyj(at)amazon(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation version tracking for macOS
Date: 2022-12-05 03:12:07
Message-ID: CA+hUKGKWX6q_UXsDhVbhYns2+VsrPLW5o6ohqO6C=cZ6r23G7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 29, 2022 at 7:51 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> > On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> > wrote:
> > > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql(at)j-davis(dot)com>
> > > wrote:
> > > > I'd vote for 1 on the grounds that it's easier to document and
> > > > understand a single collation version, which comes straight from
> > > > ucol_getVersion(). This approach makes it a separate problem to
> > > > find
> > > > the collation version among whatever libraries the admin can
> > > > provide;
> > > > but adding some observability into the search should mitigate any
> > > > confusion.
> > >
> > > OK, it sounds like I should code that up next.
> >
> > Here's the first iteration.
>
> Thank you.

Thanks for the review. Responses further down. And thanks also for
the really interesting discussion about how the version numbers work
(or in some cases, don't work...), and practical packaging and linking
problems.

To have a hope of making something happen for PG16, which I think
means we need a serious contender patch in the next few weeks, we
really need to make some decisions. I enjoyed trying out
search-by-collversion, but it's still not my favourite. On the ballot
we have two main questions:

1. Should we commit to search-by-collversion, or one of the explicit
library version ideas, and if the latter, which?
2. Should we try to support being specific about minor versions (in
various different ways according to the choice made for #1)?

My tentative votes are:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while still
making it super clear what is going on. This would include adding DDL
commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
to make warnings go way.

2. I think we should ignore minor versions for now (other than
reporting them in the relevant introspection functions), but not make
any choices that would prevent us from changing our mind about that in
a later release. For example, having two levels of specificity ICU
and ICU68 in the libver-in-provider-name design wouldn't preclude us
from adding support for ICU68_2 later

I haven't actually tried that design out in code yet, but I'm willing
to try to code that up very soon. So no new patch from me yet. Does
anyone else want to express a view?

> Proposed changes:
>
> * I attached a first pass of some documentation.

Thanks. Looks pretty good, and much of it would stay if we changed to
one of the other models.

> * Should be another GUC to turn WARNING into an ERROR. Useful at least
> for testing; perhaps too dangerous for production.

OK, will add that into the next version.

> * The libraries should be loaded in a more diliberate order. The "*"
> should be expanded in a descending fashion so that later versions are
> preferred.

Yeah, I agree.

> * GUCs should be validated.

Will do.

> * Should validate that loaded library has expected version.

Will do.

> * We need to revise or remove pg_collation_actual_version() and
> pg_database_collation_actual_version().

I never liked that use of the word "actual"...

> * The GUCs are PGC_SUSET, but don't take effect because
> icu_library_list_fully_loaded is never reset.

True. Just rought edges because I was trying to prototype
search-by-collversion fast. Will consider this for the next version.

> * The extra collations you're adding at bootstrap time are named based
> on the library major version. I suppose it might be more "proper" to
> name them based on the collation version, but that would be more
> verbose, so I won't advocate for that. Just pointing it out.

Ah, yes, the ones with names like "en-US-x-icu68". I agree that made
a little less sense in the search-by-collversion patch. Maybe we
wouldn't want these at all in the search-by-collversion model. But I
think they're perfect the way they are in the provider = ICU68 model.
The other idea I considered ages ago was that we could use namespaces:
you could "icu68.en-US", or just "en-US" in some contexts to get what
your search path sees, but that all seemed a little too cute and not
really like anything else we do with system-created catalogues, so I
gave that idea up.

> * It looks hard (or impossible) to mix multiple ICU libraries with the
> same major version and different minor versions. That's because,
> e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
> and when you install ICU 63.2, those dependencies get clobbered with
> the 63.2 versions. That fails the sanity check I proposed above about
> the library version number matching the requested library version
> number. And it also just seems wrong -- why would you have minor-
> version precision about an ICU library but then only major-version
> precision about the ICU dependencies of that library? Doesn't that
> defeat the whole purpose of this naming scheme? (Maybe another ICU
> bug?).

I don't think it's a bug exactly. That scheme is designed to
advertise ABI stability, and not intended to support parallel
installation of minor versions. It does seem a little silly for
libraries that are shipped together as one atomic unit not to use
fully qualified dependency names, though.

I think there would be various technical solutions, if you're prepared
to give up existing ready-made packages and build stuff yourself.
Install them into different directories with different DT_RPATH so
they can't see each other (but then our icu_library_path needs to
support a list of paths or it won't find these ones which will have to
be not in the usual system path), or clobber the DT_NEEDED (but I
guess not the DT_SONAME) to mention the minor version, and equivalent
concepts for other non-elf systems (at a glance the same problem
applies on macOS), or re-roll the libraries into a single .so. Or
convince them to support a single library build mode (maybe there is
one already? I couldn't find it).

That's all a bit against the grain for now, and makes me want to
abandon the notion of minor versions completely for now but leave the
option open for later exploration.

In the meantime, I think the feature is still pretty useful. For
example, it helps you with the common case of a major OS upgrade or
streaming replication across major OS versions: just find the right
.deb/rpm/whatever for the older one, and install it, until you're
ready to upgrade and REFRESH. The story is not quite as good for
someone with an index full of Chinese or Turkish text who gets a
surprise warning after a minor apt-get update, because the Japanese
have decided to invent a new character. We can't offer a nice
solution to that: they have to determine that it is safe to REFRESH to
clear the warning, with or without rebuild, or downgrade/pin the ICU
package until they are ready to REFRESH. But that is already the case
today and this patch neither helps nor hinders. The only reason we
didn't know about this pre-existing type of problem is because
(approximately) nobody uses ICU yet, because it wasn't available as a
database default yet.

> Minor comments:
>
> * ICU_I18N is defined in make_icu_library_name() but used outside of
> it. One solution might be to have it return both library names to the
> caller and rename it as make_icu_library_names().

Good idea, will do.

> * get_icu_function() could use a clarifying comment or a better name.
> Something that communicates that you are looking for the function in
> the given library with the given major version number (which may or may
> not be needed depending on how the library was compiled).

Agreed.

> * typo in comment over make_icu_collator:
> s/u_getVersion/ucol_getVersion/

Thanks.

> * The return value of make_icu_collator() seems backwards to me,
> stylistically. I typically see the false-is-good pattern with integer
> returns.

Agreed.

> * weird bracketing style in get_icu_collator for the "else"

Yep.

> > The version rosetta stone functions look like this:
> >
> > postgres=# select * from pg_icu_library_versions();
> > icu_version | unicode_version | cldr_version
> > -------------+-----------------+--------------
> > 67.1 | 13.0 | 37.0
> > 63.1 | 11.0 | 34.0
> > 57.1 | 8.0 | 29.0
> > (3 rows)
> >
> > postgres=# select * from pg_icu_collation_versions('zh');
> > icu_version | uca_version | collator_version
> > -------------+-------------+------------------
> > 67.1 | 13.0 | 153.14.37
> > 63.1 | 11.0 | 153.88.34
> > 57.1 | 8.0 | 153.64.29
> > (3 rows)
>
> I like these functions.

Yeah, they've been quite educational. Now I'm wondering what form
these functions would take in the provider = ICU68 patch.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-12-05 03:18:25 Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
Previous Message Amit Langote 2022-12-05 03:09:27 Re: ExecRTCheckPerms() and many prunable partitions