Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-25 16:36:44
Message-ID: CAH2-Wzka1nJX2pY_1k+bUM0WN-qEgyDHj2h9UdO0B8PLdgmeeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 25, 2017 at 9:06 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> The big concern I have here is that this feels a lot like something that
>> we'll regret at leisure, if it's not right in the first release. I'd
>> much rather be restrictive in v10 and then loosen the rules later, than
>> be lax in v10 and then have to argue about whether to break backwards
>> compatibility in order to gain saner behavior.
>
> I think it's inevitable that a certain number of users are going to
> have to cope with ICU version changes breaking stuff.

Wasn't the main point of adopting ICU that that doesn't happen when it
isn't essential? There will be some risk with pg_dump across ICU
versions, due rare to political changes. But, on pg_upgrade, the old
collations will continue to work, even if they would not have appeared
at initdb time on that ICU version, because ICU has all kinds of
fallbacks. It will work with a language it has heard of, but a country
it has never heard of, for example. I just want to have a consistent
collcollate format, to consistently take advantage of that.

My patch has no behavioral changes, apart from the sanitization, on
ICU >= 54, BTW. It's really not very much code, especially when you
exclude objective bug fixes.

> If ICU decides
> a collation is stupid or unused and drops it, or is mis-defined and
> redefines it to some behavior that breaks things for somebody, they
> are going to have to deal with it. I don't think you can make that
> problem go away by any amount of strictness introduced into v10, but
> if you make the checks zealous enough, you can probably make them rule
> out input that users would have preferred to have accepted.

The sanitization thing isn't my main concern. The stability issue is
the major issue, and it isn't all that connected to sanitization.

> I also think that if there's a compelling reason to bet on BCP 47 to
> be a stable canonical form, I haven't heard it presented here. At the
> risk of repeating myself, it's not even supported in some ICU versions
> we support, so how's that going to work? And if it's been changed in
> the recent past, why not again? Peter Geoghegan said that he doesn't
> know of any plans to eliminate BCP 47 support, but that doesn't seem
> like it's much proof of anything.

It's described by an IETF standard, and the extensions are described
by Unicode Technical Standard #35, a formal standard [1]. The BCP 47
format is not an ICU thing at all -- it's a Unicode consortium thing.
An express goal of BCP 47 is forward and backward compatibility [2].
And, ICU themselves have said that that's what they're standardizing
on, since they are upstream consumers of the CLDR (Unicode consortium)
locale data. They've done that because they want to move away from
their own format towards a universal format, usable in a wide variety
of contexts.

While I don't pretend to be able to predict the future, I think that
this is something that is as safe as is practically achievable to
standardize on. (It also has all kinds of provision for further
extension, should that prove necessary.)

> If, on the other hand, you introduce an error check that is overly
> stringent and precludes people from defining collations that are legal
> and useful (in their judgement, not yours), I intend to whine about
> that extensively. And that applies to 10, 11, and any future versions
> for which I may be around.

That's not going to happen.

> In short, I judge that allowing users access to *all* of the things
> that ICU has now, has had in the past in versions we support, or may
> have in the future is an important goal, but that preventing them from
> relying on options that may go away is not a goal at all, since
> barring the ability to predict the future, it's impossible anyway.

+1

> If it's possible to prevent to write a precisely-targeted check that
> rules out only actually-meaningless collations and nothing else, I'm
> fine with that.

That's what I've done with my patch, IMV. I admit that it's still a
tiny bit subjective that we should reject an empty string, and not
allow the fallback there, for example (we could instead fall back on
that as "undefined"). I think that that's not going to be a problem
for anyone.

[1] http://unicode.org/reports/tr35/tr35-collation.html
[2] https://tools.ietf.org/html/bcp47#section-3.4
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-09-25 16:39:47 Re: Built-in plugin for logical decoding output
Previous Message Pavel Stehule 2017-09-25 16:35:19 Re: logical replication and statistics