Re: ICU locale validation / canonicalization

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ICU locale validation / canonicalization
Date: 2023-03-21 09:35:56
Message-ID: b29cdc9b-1176-4456-e909-8fd64af03af8@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 17.03.23 18:55, Jeff Davis wrote:
> On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote:
>> I left out the validation patch for now, and I'm evaluating a
>> different
>> approach that will attempt to match to the locales retrieved with
>> uloc_countAvailable()/uloc_getAvailable().
>
> I like this approach, attached new patch series with that included as
> 0006.

I have looked at the first three patches. I think we want what those
patches do.

[PATCH v6 1/6] Support language tags in older ICU versions (53 and
earlier).

In pg_import_system_collations(), this is now redundant and can be
simplified:

- if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+ if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))

icu_set_collation_attributes() needs more commenting about what is going
on. My guess is that uloc_canonicalize() converts from language tag to
ICU locale ID, and then the existing logic to parse that apart would
apply. Is that how it works?

[PATCH v6 2/6] Wrap ICU ucol_open().

It makes sense to try to unify some of this. But I find the naming
confusing. If I see pg_ucol_open(), then I would expect that all calls
to ucol_open() would be replaced by this. But here it's only a few,
without explanation. (pg_ucol_open() has no explanation at all AFAICT.)

I have in my notes that check_icu_locale() and make_icu_collator()
should be combined into a single function. I think that would be a
better way to slice it.

Btw., I had intentionally not written code like this

+#if U_ICU_VERSION_MAJOR_NUM < 54
+ icu_set_collation_attributes(collator, loc_str);
+#endif

The disadvantage of doing it that way is that you then need to dig out
an old version of ICU in order to check whether the code compiles at
all. With the current code, you can be sure that that code compiles if
you make changes elsewhere.

[PATCH v6 3/6] Handle the "und" locale in ICU versions 54 and older.

This makes sense, but the same comment about not #if'ing out code for
old ICU versions applies here.

The

+#ifdef USE_ICU
+

before pg_ucol_open() probably belongs in patch 2.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2023-03-21 09:41:35 Comment in preptlist.c
Previous Message Melih Mutlu 2023-03-21 09:32:40 Re: Allow logical replication to copy tables in binary format