Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Andreas Karlsson <andreas(at)proxel(dot)se>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-22 18:34:45
Message-ID: 4e66ffa4-97db-d656-5b2a-12f9e8f50d91@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/21/17 16:55, Peter Geoghegan wrote:
>> I strongly prefer if there, as much as possible, is only one format for
>> inputting ICU locales.
> I agree, but the bigger issue is that we're *half way* between
> supporting only one format, and supporting two formats. AFAICT, there
> is no reason that we can't simply support one format on all ICU
> versions, and keep what ends up within pg_collation at initdb time
> essentially the same across ICU versions (except for those that are
> due to cultural/political developments).

After reviewing this thread and testing around a bit, I think the code
is mostly fine as it is, but the documentation is lacking. Hence,
attached is a patch to expand the documentation quite a bit, especially
to document in more detail what ICU locale strings are accepted.

The documentation has always stated, albeit perhaps in obscure ways,
that we accept for locales whatever ICU accepts. I don't think we
should restrict or override that in any way. That would cause existing
documentation and lore on ICU to be invalid for PostgreSQL.

I specifically disagree that we should, as appears to be suggested here,
restrict the input locale strings to BCP 47 and reject or transform the
traditional ICU-specific locale syntax. Again, that would cause
existing ICU documentation material to become less applicable to
PostgreSQL. And basically, there is no reason for it, because I am not
aware that ICU plans to get rid of that syntax. Moreover, we need to
support that syntax for older ICU versions anyway. A patch has been
posted that, as I understand it, would allow BCP 47 syntax to be used
with older versions as well. That's a nice idea, but it's a new feature
and not appropriate for PG10 at this time.

(Note that we also don't canonicalize libc locale names.)

The attached documentation patch documents both locale naming forms in
parallel.

The other attached patch contains a test suite that verifies that the
examples in the documentation actually work. It's not meant for
committing into the mainline, but it was useful for me.

During testing with various versions I have also discovered the
following things:

- The current code appears to be of the opinion that BCP 47 locale names
are accepted as of ICU 54. That appears to be incorrect; I find that
they work fine in ICU 52, but they don't work in ICU 4.2. I don't know
where the cutoff is. That might be worth changing in the code if we can
obtain more information.

- What might have led to the above mistake is the following in the
ucol_open() documentation: q{Starting with ICU 54, collation attributes
can be specified via locale keywords as well, in the old locale
extension syntax ("el(at)colCaseFirst=upper") or in language tag syntax
("el-u-kf-upper").} That is correct. If you use that syntax in earlier
versions, the case-first specification in this example is ignored. You
need to use ucol_setAttribute() then. (Note that the phonebook stuff
still works, because that is not a "collation attribute".)

(One of my plans for PG11 had been to allow specifying such collation
attributes via additional CREATE COLLATION clauses and pg_collation
columns, but that might be obsolete now.)

- Moreover, it is not the case that ICU accepts just any sort of
nonsense as a locale name. For example, "el(at)colCaseFirst=whatever" is
rejected with U_ILLEGAL_ARGUMENT_ERROR. Now, it might in other cases be
more liberal than we might be hoping for. But I think they have reasons
for what they do.

One conclusion here is that there are multiple dimensions to what sort
of thing is accepted as a locale in different versions of ICU and what
the canonical format is. If we insist that everything has to be written
in the form that is preferred today, then we'll have awkward problems if
a future version of ICU establishes even more variants that will then be
preferred.

I think there is room for incremental refinement here. Features like
optionally checking or printing the canonical or preferred locale format
or making the locale description available via a function or system view
might all be valuable additions to a future PostgreSQL release.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Expand-collation-documentation.patch text/plain 13.1 KB
0002-Test-cases-for-advanced-ICU-features.patch text/plain 6.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-09-22 18:36:31 Re: "inconsistent page found" with checksum and wal_consistency_checking enabled
Previous Message Robert Haas 2017-09-22 18:26:14 Re: Page Scan Mode in Hash Index