Re: ICU locale validation / canonicalization

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ICU locale validation / canonicalization
Date: 2023-03-24 23:28:24
Message-ID: 2a9013df9a197aa5eb2d85b5f7ec0c5a726b6a2e.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote:
> Couldn't we do this in a simpler way by just freeing the collator
> before
> the ereport() calls.

I committed a tiny patch to do this.

We still need to address the error inconsistency though. The problem is
that, in older ICU versions, if the fixup for "und(at)colNumeric=lower" ->
"root(at)colNumeric=lower" is applied, then icu_set_collation_attributes()
will throw an error reporting "root(at)colNumeric=lower", which is not
what the user typed.

We could fix that directly by passing the original string to
icu_set_collation_attributes() instead, or perhaps as an extra
parameter used only for the ereport().

I like the minor refactoring I did better, though. It puts the
ereports() close to each other, so any differences are more obvious.
And it seems cleaner to me for pg_ucol_open to close the UCollator
because it's the one that opened it. I don't have a strong opinion, but
that's my reasoning.

>   Or wrap a PG_TRY/PG_FINALLY around the whole thing?

I generally avoid PG_TRY/FINALLY unless it avoids some major
awkwardness or other problem.

> It would be nicer to not make the callers of
> icu_set_collation_attributes() responsible for catching and reporting
> the errors.

There's only one caller now: pg_ucol_open().

> [PATCH v8 2/4] initdb: emit message when using default ICU locale.
>
> I'm not able to make initdb print this message.  Under what
> circumstances am I supposed to see this?  Do you have some examples?

It happens when you don't specify --icu-locale. It is slightly
redundant with "ICU locale", but it lets you see that it came from the
environment rather than the command line:

-------------
$ initdb -D data
The files belonging to this database system will be owned by user
"someone".
This user must also own the server process.

Using default ICU locale "en_US_POSIX".
The database cluster will be initialized with this locale
configuration:
provider: icu
ICU locale: en_US_POSIX
...
-------------

That seems fairly useful for testing, etc., where initdb.log doesn't
show the command line options.

> The function check_icu_locale() has now gotten a lot more
> functionality
> than its name suggests.  Maybe the part that assigns the default ICU
> locale should be moved up one level to setlocales(), which has a
> better
> name and does something similar for the libc locale realm.

Agreed, done.

In fact, initdb.c:check_icu_locale() is completely unnecessary in that
patch, because as the comment points out, the backend will try to open
it during post-bootstrap initialization. I think it was simply a
mistake to try to do this validation in commit 27b62377b4.

The later validation patch does do some better validation at initdb
time to make sure the language can be found.

> [PATCH v8 3/4] Canonicalize ICU locale names to language tags.
>
> I'm still on the fence about whether we actually want to do this, but
> I'm warming up to it, now that the issues with pre-54 versions are
> fixed.
>
> But if we do this, the documentation needs to be updated.  There is a
> bunch of text there that says, like, you can do this format or that
> format, whatever you like.  At least the guidance should be changed
> there.
>
>
> [PATCH v8 4/4] Validate ICU locales.
>
> I would make icu_locale_validation true by default.

Agreed. I considered also not having a GUC, but it seems like some kind
of escape hatch is wise, at least for now.

> Or maybe it should be a log-level type option, so you can set it to
> error, warning, and also completely off?

As the validation patch seems closer to acceptance, I changed it to be
before the canonicalization patch. New series attached.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v9-0001-Fix-error-inconsistency-in-older-ICU-versions.patch text/x-patch 4.8 KB
v9-0002-initdb-replace-check_icu_locale-with-default_icu_.patch text/x-patch 3.4 KB
v9-0003-initdb-emit-message-when-using-default-ICU-locale.patch text/x-patch 1004 bytes
v9-0004-Validate-ICU-locales.patch text/x-patch 15.1 KB
v9-0005-Canonicalize-ICU-locale-names-to-language-tags.patch text/x-patch 22.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-03-25 00:02:18 Re: running logical replication as the subscription owner
Previous Message Anton Kirilov 2023-03-24 22:38:48 Add PQsendSyncMessage() to libpq