Re: ICU locale validation / canonicalization

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ICU locale validation / canonicalization
Date: 2023-03-09 08:46:46
Message-ID: 899ab44a-4307-064f-0945-412723d57c02@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28.02.23 06:57, Jeff Davis wrote:
> On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:
>>
>> New patch attached. The new patch also includes a GUC that (when
>> enabled) validates that the collator is actually found.
>
> New patch attached.
>
> Now it always preserves the exact locale string during pg_upgrade, and
> does not attempt to canonicalize it. Before it was trying to be clever
> by determining if the language tag was finding the same collator as the
> original string -- I didn't find a problem with that, but it just
> seemed a bit too clever. So, only newly-created locales and databases
> have the ICU locale string canonicalized to a language tag.
>
> Also, I added a SQL function pg_icu_language_tag() that can convert
> locale strings to language tags, and check whether they exist or not.

This patch appears to do about three things at once, and it's not clear
exactly where the boundaries are between them and which ones we might
actually want. And I think the terminology also gets mixed up a bit,
which makes following this harder.

1. Canonicalizing the locale string. This is presumably what
uloc_canonicalize() does, which the patch doesn't actually use. What
are examples of what this does? Does the patch actually do this?

2. Converting the locale string to BCP 47 format. This converts
'de(at)collation=phonebook' to 'de-u-co-phonebk'. This is what
uloc_getLanguageTag() does.

3. Validating the locale string, to reject faulty input.

What are the relationships between these?

I don't understand how the validation actually happens in your patch.
Does uloc_getLanguageTag() do the validation also?

Can you do canonicalization without converting to language tag?

Can you do validation of un-canonicalized locale names?

What is the guidance for the use of the icu_locale_validation GUC?

The description throws in yet another term: "validates that ICU locale
strings are well-formed". What is "well-formed"? How does that relate
to the other concepts?

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de(at)collation=phonebook');
NOTICE: 00000: using language tag "de-u-co-phonebk" for locale
"de(at)collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to migrate
people to the "new" spellings, but the old ones aren't wrong. So this
should be a separate consideration, with an option, and it would require
various updates in the documentation. It also doesn't appear to address
how to handle ICU before version 54.

But, see earlier questions, are these three things all connected somehow?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-03-09 09:01:14 Re: Allow tests to pass in OpenSSL FIPS mode
Previous Message Julien Rouhaud 2023-03-09 08:34:56 Re: pg_upgrade and logical replication