Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-20 02:01:47
Message-ID: CAH2-WzkyJCJNzarYYj0HTt0NTUWpKBuEmnUoX8QDA6+XRFE71Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> On 9/18/17 18:46, Peter Geoghegan wrote:
>> As I pointed out a couple of times already [1], we don't currently
>> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
>
> There is no requirement that the locale strings for ICU need to be BCP
> 47. ICU locale names like 'de(at)collation=phonebook' are also acceptable.

Right. But, we only document that BCP 47 is supported by Postgres.
Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
that we end up with BCP 47, even when the user happens to specify the
legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
as BCP 47, too?

> The reason they are not validated is that, as you know, ICU accepts any
> locale string as valid. You appear to have found a way to do some
> validation, but I would like to see that code.

As I mentioned, I'm simply calling
get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
The code to get the extra sanitization is completely trivial.

I didn't post the patch that generates the errors in my opening e-mail
because I'm not confident it's correct just yet. And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1]. Seems entirely possible
that this accounts for the problems you saw on ICU 4.2, back when we
were still creating keyword variants (I guess that the keyword
variants seem very "BCP 47-ish" to me).

I really think we need to add some kind of debug mode that makes ICU
optionally spit out a locale display name at key points. We need this
to gain confidence that the behavior that ICU provides actually
matches what is expected across ICU different versions for different
locale formats. We talked about this as a user-facing feature before,
which can wait till v11; I just want this to debug problems like this
one.

[1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-09-20 02:01:54 Re: Boom filters for hash joins (was: A design for amcheck heapam verification)
Previous Message Andres Freund 2017-09-20 02:00:38 Re: pgsql: Make new crash restart test a bit more robust.