Re: ICU for global collation

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: ICU for global collation
Date: 2022-03-14 12:50:50
Message-ID: 42ce518d-1426-fa9e-1fcc-de090cd55331@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05.03.22 09:38, Julien Rouhaud wrote:
> I say it works because I did manually check, as far as I can see there isn't
> any test that ensures it.
>
> I'm using this naive scenario:
>
> DROP DATABASE IF EXISTS dbicu;
> CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0';
> \c dbicu
> CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
> CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst);
> INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B');
> SELECT def AS def FROM icu ORDER BY def;
> SELECT def AS en FROM icu ORDER BY en;
> SELECT def AS upfirst FROM icu ORDER BY upfirst;
> SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
> SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";
>
> Maybe there should be some test along those lines included in the patch?

I added something like this to a new test suite under src/test/icu/.
(src/test/locale/ was already used for something else.)

> Also, it's not even used consistently in the patch. I can see e.g. "ICU
> locale" or "ICU locale setting" being used:

I have fixed those.

> Maybe we could point to the ICU documentation for a clear notion of what a
> locale ID is, and/or use their own short definition [1]:
>
>> The locale object in ICU is an identifier that specifies a particular locale
>> and has fields for language, country, and an optional code to specify further
>> variants or subdivisions. These fields also can be represented as a string
>> with the fields separated by an underscore.

I think the Localization chapter needs to be reorganized a bit, but I'll
leave that for a separate patch.

> I spent some time looking at the ICU api trying to figure out if using a
> posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> same locale, but I might be wrong. I also didn't find a way to figure out how
> to ask ICU if the locale identifier passed is complete garbage or not. One
> sure thing is that the system collation we import are of the form 'en-us', so
> it seems weird to have this form in pg_collation and by default another form in
> pg_database.

Yeah it seems to be inconsistent about that. The locale ID
documentation appears to indicate that "en_US" is the canonical form,
but when you ask it to list all the locales it knows about it returns
"en-US".

> In CREATE DATABASE manual:
>
> + Specifies the provider to use for the default collation in this
> + database. Possible values are:
> + <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm>
> + <literal>libc</literal>. <literal>libc</literal> is the default. The
> + available choices depend on the operating system and build options.
>
> That's actually not true as pg_strcasecmp is used in createdb():
>
> + if (pg_strcasecmp(locproviderstr, "icu") == 0)
> + dblocprovider = COLLPROVIDER_ICU;
> + else if (pg_strcasecmp(locproviderstr, "libc") == 0)
> + dblocprovider = COLLPROVIDER_LIBC;
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> + errmsg("unrecognized locale provider: %s",
> + locproviderstr)));

I don't understand what the concern is here.

> Unless I'm missing something you entirely removed the warninng about the
> performance penalty when using non C/POSIX locale?

Yeah, I think it is not the job of the initdb man page to lecture people
about the merits of their locale choice. The same performance warning
can also be found in the localization chapter; we don't need to repeat
it everywhere a locale choice is mentioned.

> initdb has some specific processing to transform the default libc locale to
> something more appropriate, but as far as I can see creatdb / CREATE DATABASE
> aren't doing that. It seems inconsistent, and IMHO another reason why
> defaulting to the libc locale looks like a bad idea.

This has all been removed. The separate ICU locale option should now be
required everywhere (initdb, createdb, CREATE DATABASE).

> While on that topic, the doc should probably mention that default ICU
> collations can only be deterministic.

Well, there is no option to do otherwise, so I'm not sure where/how to
mention that. We usually don't document options that don't exist. ;-)

> Also unrelated, but how about raising a warning about possibly hiding
> corruption if you use CREATE COLLATION ... VERSION or CREATE DATABASE ...
> COLLATION_VERSION if !IsBinaryUpgrade()?

Hmm, there is a point to that. But then we should just make that an
error. Otherwise, we raise the warning but then there is no way to
"fix" what was warned about. Seems confusing.

> in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
> and pg_database_collation_actual_version():
>
> - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
> - Assert(!isnull);
> - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
> + if (!isnull)
> + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> + else
> + newversion = NULL;
>
> The columns are now nullable, but can you actually end up with a null locale
> name in the expected field without manual DML on the catalog, corruption or
> similar? I think it should be a plain error explaining the inconsistency
> rather then silently assuming there's no version. Note that at least
> pg_newlocale_from_collation() asserts that the specific libc/icu field it's
> interested in isn't null.

This is required because the default collations's fields are null now.

> Why not using errorConflictingDefElem for locale_provider?

fixed

> in createdb():
>
> + if (dblocprovider == COLLPROVIDER_ICU)
> + {
> + /*
> + * This would happen if template0 uses the libc provider but the new
> + * database uses icu.
> + */
> + if (!dbiculocale)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("ICU locale must be specified")));
> + }
> +
> + if (dblocprovider == COLLPROVIDER_ICU)
> + {
> +#ifdef USE_ICU
> [...]
>
> Seems like a refactoring leftover, no need for two blocks.

These are two independent blocks in my mind. It's possible that someone
might want to insert something in the middle at some point.

> Also, I think it
> would be better to first error out if there's no support for ICU rather than
> complaining about a possibly missing locale that wouldn't be accepted anyway.

Seems better to do general syntax checks first and then deeper checks of
the passed values.

> +void
> +make_icu_collator(const char *iculocstr,
> + struct pg_locale_struct *resultp)
> +{
> +#ifdef USE_ICU
> +[...]
> + /* We will leak this string if we get an error below :-( */
> + resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
> + resultp->info.icu.ucol = collator;
> +#else /* not USE_ICU */
> + /* could get here if a collation was created by a build with ICU */
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("ICU is not supported in this build"), \
> + errhint("You need to rebuild PostgreSQL using %s.", "--with-icu")));
> +#endif /* not USE_ICU */
> +}
>
> The comment about the leak possibility needs some tweaking since it's been
> extracted from a larger function.

fixed

> Not material for this patch, but:
> @@ -1475,7 +1524,12 @@ pg_newlocale_from_collation(Oid collid)
> Assert(OidIsValid(collid));
>
> if (collid == DEFAULT_COLLATION_OID)
> - return (pg_locale_t) 0;
> + {
> + if (default_locale.provider == COLLPROVIDER_ICU)
> + return &default_locale;
> + else
> + return (pg_locale_t) 0;
> + }
>
> I'm wondering if we could now always return &default_locale and avoid having to
> check if the function returned something in all callers, since CheckMyDatabase
> now initialize it.

Maybe that's something to look into, but that would probably require
updating a call callers to handle the return values differently, which
would require quite a bit of work.

> Shouldn't all the code before checking the locale name also be in the #ifdef
> USE_ICU?

I like to keep the #ifdef sections as small as possible and limited to
the code that really uses the respective library.

> @@ -2859,6 +2870,17 @@ dumpDatabase(Archive *fout)
> appendPQExpBufferStr(creaQry, " ENCODING = ");
> appendStringLiteralAH(creaQry, encoding, fout);
> }
> + if (strlen(datlocprovider) > 0)
> + {
> + appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
> + if (datlocprovider[0] == 'c')
> + appendPQExpBufferStr(creaQry, "libc");
> + else if (datlocprovider[0] == 'i')
> + appendPQExpBufferStr(creaQry, "icu");
> + else
> + fatal("unrecognized locale provider: %s",
> + datlocprovider);
> + }
> if (strlen(collate) > 0 && strcmp(collate, ctype) == 0)
> {
>
> AFAICS datlocprovider shouldn't be empty. Maybe raise an error or an assert if
> that's the case?

Yeah that was bogus, copied from the earlier encoding handling, which is
probably also bogus, but I'm not touching that here.

> # CREATE DATABASE db1 LOCALE_PROVIDER icu ICU_LOCALE "fr-x-icu";
> ERROR: 22023: new locale provider (i) does not match locale provider of the template database (c)
> HINT: Use the same locale provider as in the template database, or use template0 as template.
>
> It feels strange to write "LOCALE_PROVIDER icu" and get "provider (i)" in the
> error message. I think it would be better to emit the user-facing value, not
> internal one.

Fixed. I had added a collprovider_name() function but didn't use it here.

> Finally, there are some changes in pg_collation, I'm assuming it's done for
> consistency since pg_database may need two different locale information, but it
> should probably be at least mentioned in the commit message.

done

Attachment Content-Type Size
v6-0001-Add-option-to-use-ICU-as-global-locale-provider.patch text/plain 76.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-03-14 13:18:42 Re: On login trigger: take three
Previous Message Amit Kapila 2022-03-14 12:47:10 Re: Column Filtering in Logical Replication