Re: ICU locale validation / canonicalization

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ICU locale validation / canonicalization
Date: 2023-02-20 23:23:40
Message-ID: 8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote:
> a) BCP47 tags are preferred, and

Agreed.

> b) They don't work with ICU versions before 54.

I tried in versions 50 through 53, and the language tags are supported,
but I think I know why we don't use them:

Prior to version 54, ICU would not set the collator attributes based on
the locale name. That is the same for either language tags or ICU
format locale IDs. However, for ICU format locale IDs, we added special
code to parse the locale string and set the attributes ourselves. We
didn't bother to add the same parsing logic for language tags, so if a
language tag is found in the catalog, the parts of it that specify
collation strength (for example) would be ignored. I don't know if
that's an actual problem when importing the system collations, because
I don't think we use any collator attributes, but it makes sense that
we'd not favor language tags in ICU prior to v54.

> I would support transitioning this forward somehow, but we would need
> to
> know exactly what the impact would be.

I've done quite a bit of investigation, which I've described upthread.

We need to transition somehow, because the prior behavior is incorrect
for locales like "fr_CA.UTF-8". Our tests suggest that's an acceptable
thing to do, but if we pass that straight to ucol_open(), then it gets
misinterpreted as plain "fr" because it doesn't understand the "." as a
valid separator. We must turn it into a language tag (or at least
canonicalize it) before passing the string to ucol_open().

This misbehavior only affects a small number of locales, which resolve
to a different actual collator than they should. The most problematic
case is during pg_upgrade, where a slight behavior change would result
in corrupt indexes. So during binary upgrade, my patch falls back to
the original raw string (not the language tag) when it resolves to a
different actual collator. If we want to be more paranoid, we could
also provide a compatibility GUC to preserve the old misbehavior for
newly-created collations, too, but I don't think that's necessary.

There is also some interaction with pg_upgrade's ability to check
whether the old and new cluster are compatible. If the catalog
representation of the locale changes, then it could falsely believe the
icu locales aren't compatible, because it's doing a simple string
comparison. But as we are discussing in the other thread[1], the whole
idea of checking for compatibility of the initialized cluster is
strange: pg_upgrade should be in charge of making a compatible cluster
to upgrade into (assuming the binaries are at least compatible). I
don't see this as a major problem; we'll sort out the other thread
first to allow ICU as the default, and then adapt this patch if
necessary.

[1]
https://www.postgresql.org/message-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v2-0001-ICU-locale-string-canonicalization-and-validation.patch text/x-patch 24.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2023-02-20 23:35:55 Disable rdns for Kerberos tests
Previous Message Stephen Frost 2023-02-20 23:17:21 Re: Weird failure with latches in curculio on v15[