Re: ICU for global collation

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: ICU for global collation
Date: 2022-03-05 08:38:30
Message-ID: 20220305083830.lpz3k3yku5lmm5xs@jrouhaud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Feb 16, 2022 at 03:25:40PM +0100, Peter Eisentraut wrote:
>
> All that preliminary work has been completed, so here is a new patch.
>
> There isn't actually much left here now except all the new DDL and
> command-line options to set this up and documentation for those. I have
> given all that another review and I hope it's more intuitive now, but I
> guess there will be other opinions.

Sorry it took me a bit of time to come back to this patch.

TL;DR it works as expected. I just have a few comments, mostly on the doc.

I say it works because I did manually check, as far as I can see there isn't
any test that ensures it.

I'm using this naive scenario:

DROP DATABASE IF EXISTS dbicu;
CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0';
\c dbicu
CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst);
INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B');
SELECT def AS def FROM icu ORDER BY def;
SELECT def AS en FROM icu ORDER BY en;
SELECT def AS upfirst FROM icu ORDER BY upfirst;
SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";

Maybe there should be some test along those lines included in the patch?

> I have changed the terminology a bit to match ICU better. It's now called
> "ICU locale ID" and "locale provider" (instead of "collation"). It might
> actually cover things that are not strictly collations (such as the isalpha
> stuff in text search, maybe, in the future).

I'm not sure that's actually such an improvement as-is. Simply saying "ICU
locale ID" is, at least for me, somewhat ambiguous as I don't see any place in
our docs where it's clearly defined. I'm afraid that users might confuse it
with the OID of a pg_collation line, or even a collation name (like en-x-icu),
especially since there's no simple way to know if what you used means what you
thought it meant.

Also, it's not even used consistently in the patch. I can see e.g. "ICU
locale" or "ICU locale setting" being used:

+ <varlistentry>
+ <term><replaceable class="parameter">icu_locale</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the ICU locale if the ICU locale provider is used. If
+ this is not specified, the value from the <literal>LOCALE</literal>
+ option is used.
+ </para>
+ </listitem>
+ </varlistentry>

+ <term><option>--icu-locale=<replaceable class="parameter">locale</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the ICU locale setting to be used in this database, if the
+ ICU locale provider is selected.
+ </para>

Maybe we could point to the ICU documentation for a clear notion of what a
locale ID is, and/or use their own short definition [1]:

> The locale object in ICU is an identifier that specifies a particular locale
> and has fields for language, country, and an optional code to specify further
> variants or subdivisions. These fields also can be represented as a string
> with the fields separated by an underscore.

It seems critical to be crystal clear about what should be specified there.

I spent some time looking at the ICU api trying to figure out if using a
posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
same locale, but I might be wrong. I also didn't find a way to figure out how
to ask ICU if the locale identifier passed is complete garbage or not. One
sure thing is that the system collation we import are of the form 'en-us', so
it seems weird to have this form in pg_collation and by default another form in
pg_database.

All in all I'm a bit worried of having the icu_locale optional. Note that this
is inconsistent with createdb(), as there's at least one code path where the
icu_locale is not optional:

+ if (dblocprovider == COLLPROVIDER_ICU)
+ {
+ /*
+ * This would happen if template0 uses the libc provider but the new
+ * database uses icu.
+ */
+ if (!dbiculocale)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("ICU locale must be specified")));
+ }

>
> One thing that is left that bothers me is that the invocations of
> get_collation_actual_version() have all gotten quite complicated. I'm
> thinking about ways to refactor that, but I haven't got a great idea yet.

Indeed, and I don't have a great idea either. Maybe
get_collation_actual_version_extended that would accept both strings?

In CREATE DATABASE manual:

+ Specifies the provider to use for the default collation in this
+ database. Possible values are:
+ <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm>
+ <literal>libc</literal>. <literal>libc</literal> is the default. The
+ available choices depend on the operating system and build options.

That's actually not true as pg_strcasecmp is used in createdb():

+ if (pg_strcasecmp(locproviderstr, "icu") == 0)
+ dblocprovider = COLLPROVIDER_ICU;
+ else if (pg_strcasecmp(locproviderstr, "libc") == 0)
+ dblocprovider = COLLPROVIDER_LIBC;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("unrecognized locale provider: %s",
+ locproviderstr)));

By extension that's the same for createdb, but createdb manual doesn't
explicitly mention that case is important so I guess that's ok.

- To alter the default collation order or character set classes, use the
- <option>--lc-collate</option> and <option>--lc-ctype</option> options.
- Collation orders other than <literal>C</literal> or <literal>POSIX</literal> also have
- a performance penalty. For these reasons it is important to choose the
- right locale when running <command>initdb</command>.
+ To choose a different locale for the cluster, use the option
+ <option>--locale</option>. There are also individual options
+ <option>--lc-*</option> (see below) to set values for the individual locale
+ categories. Note that inconsistent settings for different locale
+ categories can give nonsensical results, so this should be used with care.

Unless I'm missing something you entirely removed the warninng about the
performance penalty when using non C/POSIX locale?

+ To chose the specific ICU locale ID to apply, use the option
+ <option>--icu-locale</option>. The ICU locale ID defaults to
+ <option>--locale</option> or the environment, as above (with some name
+ mangling applied to make the locale naming appropriate for ICU). Note that
+ for implementation reasons and to support legacy code,
+ <command>initdb</command> will still select and initialize libc locale
+ settings when the ICU locale provider is used.

initdb has some specific processing to transform the default libc locale to
something more appropriate, but as far as I can see creatdb / CREATE DATABASE
aren't doing that. It seems inconsistent, and IMHO another reason why
defaulting to the libc locale looks like a bad idea.

While on that topic, the doc should probably mention that default ICU
collations can only be deterministic.

@@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
errmsg("collation \"default\" cannot be copied")));
}

- if (localeEl)
- {
- collcollate = defGetString(localeEl);
- collctype = defGetString(localeEl);
- }
[...]

I tried to read the function and quickly got confused about whether possible
problematic conditions could be reached or not and had protection or not. I
think that DefineCollation is becoming more and more unreadable, with a mix of
fromEl, !fromEl and either. Maybe it would be a good time to refactor the code
a bit and have have something like

if (fromEl)
{
// initialize all specific things
}
else
{
// initialize all specific things
}

// handle defaults and sanity checks

What do you think?

Also unrelated, but how about raising a warning about possibly hiding
corruption if you use CREATE COLLATION ... VERSION or CREATE DATABASE ...
COLLATION_VERSION if !IsBinaryUpgrade()?

in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
and pg_database_collation_actual_version():

- datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
- Assert(!isnull);
- newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
+ datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
+ if (!isnull)
+ newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
+ else
+ newversion = NULL;

The columns are now nullable, but can you actually end up with a null locale
name in the expected field without manual DML on the catalog, corruption or
similar? I think it should be a plain error explaining the inconsistency
rather then silently assuming there's no version. Note that at least
pg_newlocale_from_collation() asserts that the specific libc/icu field it's
interested in isn't null.

+ else if (strcmp(defel->defname, "icu_locale") == 0)
+ {
+ if (diculocale)
+ errorConflictingDefElem(defel, pstate);
+ diculocale = defel;
+ }
+ else if (strcmp(defel->defname, "locale_provider") == 0)
+ {
+ if (dlocprovider)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options"),
+ parser_errposition(pstate, defel->location)));
+ dlocprovider = defel;
+ }

Why not using errorConflictingDefElem for locale_provider?

in createdb():

+ if (dblocprovider == COLLPROVIDER_ICU)
+ {
+ /*
+ * This would happen if template0 uses the libc provider but the new
+ * database uses icu.
+ */
+ if (!dbiculocale)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("ICU locale must be specified")));
+ }
+
+ if (dblocprovider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
[...]

Seems like a refactoring leftover, no need for two blocks. Also, I think it
would be better to first error out if there's no support for ICU rather than
complaining about a possibly missing locale that wouldn't be accepted anyway.

+void
+make_icu_collator(const char *iculocstr,
+ struct pg_locale_struct *resultp)
+{
+#ifdef USE_ICU
+[...]
+ /* We will leak this string if we get an error below :-( */
+ resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
+ resultp->info.icu.ucol = collator;
+#else /* not USE_ICU */
+ /* could get here if a collation was created by a build with ICU */
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build"), \
+ errhint("You need to rebuild PostgreSQL using %s.", "--with-icu")));
+#endif /* not USE_ICU */
+}

The comment about the leak possibility needs some tweaking since it's been
extracted from a larger function.

Not material for this patch, but:
@@ -1475,7 +1524,12 @@ pg_newlocale_from_collation(Oid collid)
Assert(OidIsValid(collid));

if (collid == DEFAULT_COLLATION_OID)
- return (pg_locale_t) 0;
+ {
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ return &default_locale;
+ else
+ return (pg_locale_t) 0;
+ }

I'm wondering if we could now always return &default_locale and avoid having to
check if the function returned something in all callers, since CheckMyDatabase
now initialize it.

@@ -2184,6 +2195,50 @@ setlocales(void)
check_locale_name(LC_CTYPE, lc_messages, &canonname);
lc_messages = canonname;
#endif
+
+ if (locale_provider == COLLPROVIDER_ICU)
+ {
+ if (!icu_locale && locale)
+ icu_locale = locale;
+
+ /*
+ * If ICU is selected but no ICU locale has been given, take the
[...]
+ /*
+ * Check ICU locale name
+ */
+#ifdef USE_ICU
+ {
+ UErrorCode status;
+
+ status = U_ZERO_ERROR;
+ ucol_open(icu_locale, &status);
+ if (U_FAILURE(status))
+ {
+ pg_log_error("could not open collator for locale \"%s\": %s",
+ icu_locale, u_errorName(status));
+ exit(1);
+ }
+ }
+#else
+ pg_log_error("ICU is not supported in this build");
+ fprintf(stderr, _("You need to rebuild PostgreSQL using %s.\n"), "--with-icu");
+ exit(1);
+#endif
+ }

Shouldn't all the code before checking the locale name also be in the #ifdef
USE_ICU? Also, the comment should be consistent with the doc and mention
ICU locale ID.

@@ -2859,6 +2870,17 @@ dumpDatabase(Archive *fout)
appendPQExpBufferStr(creaQry, " ENCODING = ");
appendStringLiteralAH(creaQry, encoding, fout);
}
+ if (strlen(datlocprovider) > 0)
+ {
+ appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
+ if (datlocprovider[0] == 'c')
+ appendPQExpBufferStr(creaQry, "libc");
+ else if (datlocprovider[0] == 'i')
+ appendPQExpBufferStr(creaQry, "icu");
+ else
+ fatal("unrecognized locale provider: %s",
+ datlocprovider);
+ }
if (strlen(collate) > 0 && strcmp(collate, ctype) == 0)
{

AFAICS datlocprovider shouldn't be empty. Maybe raise an error or an assert if
that's the case?

# CREATE DATABASE db1 LOCALE_PROVIDER icu ICU_LOCALE "fr-x-icu";
ERROR: 22023: new locale provider (i) does not match locale provider of the template database (c)
HINT: Use the same locale provider as in the template database, or use template0 as template.

It feels strange to write "LOCALE_PROVIDER icu" and get "provider (i)" in the
error message. I think it would be better to emit the user-facing value, not
internal one.

Finally, there are some changes in pg_collation, I'm assuming it's done for
consistency since pg_database may need two different locale information, but it
should probably be at least mentioned in the commit message.

[1]: https://unicode-org.github.io/icu/userguide/locale/#:~:text=The%20locale%20object%20in%20ICU,fields%20separated%20by%20an%20underscore.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-03-05 08:44:54 Re: Allow async standbys wait for sync replication
Previous Message Tomas Vondra 2022-03-05 07:13:00 Re: ltree_gist indexes broken after pg_upgrade from 12 to 13