Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-25 08:26:21
Message-ID: 20170925082621.GB37725@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> > On 9/18/17 18:46, Peter Geoghegan wrote:
> >> As I pointed out a couple of times already [1], we don't currently
> >> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
> >
> > There is no requirement that the locale strings for ICU need to be BCP
> > 47. ICU locale names like 'de(at)collation=phonebook' are also acceptable.
>
> Right. But, we only document that BCP 47 is supported by Postgres.
> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
> that we end up with BCP 47, even when the user happens to specify the
> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
> as BCP 47, too?
>
> > The reason they are not validated is that, as you know, ICU accepts any
> > locale string as valid. You appear to have found a way to do some
> > validation, but I would like to see that code.
>
> As I mentioned, I'm simply calling
> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
> The code to get the extra sanitization is completely trivial.
>
> I didn't post the patch that generates the errors in my opening e-mail
> because I'm not confident it's correct just yet. And, I think that I
> see a bigger problem: we pass a string that is almost certainly a BCP
> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> do so despite the fact that ucol_open() apparently doesn't accept BCP
> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
> that this accounts for the problems you saw on ICU 4.2, back when we
> were still creating keyword variants (I guess that the keyword
> variants seem very "BCP 47-ish" to me).
>
> I really think we need to add some kind of debug mode that makes ICU
> optionally spit out a locale display name at key points. We need this
> to gain confidence that the behavior that ICU provides actually
> matches what is expected across ICU different versions for different
> locale formats. We talked about this as a user-facing feature before,
> which can wait till v11; I just want this to debug problems like this
> one.
>
> [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-09-25 08:57:21 Re: SERIALIZABLE with parallel query
Previous Message Amit Langote 2017-09-25 07:42:22 Re: Shaky coding for vacuuming partitioned relations