Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-09-23 01:53:02
Message-ID: CAH2-WzkLuHXwsVjzL_9EM5ZGrXkPsKzuJtVrVCEM+uBihHwNug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 22, 2017 at 5:58 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 22, 2017 at 4:46 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> But you are *already* canonicalizing ICU collation names as BCP 47. My
>> point here is: Why not finish the job off, and *also* canonicalize
>> colcollate in the same way?
>
> Peter, with respect, it's time to let this argument go. We're
> scheduled to wrap a GA release in just over 72 hours. It is far too
> late to change behavior like this.

I didn't say that it wasn't. That's above my paygrade.

> On the substantive issue, I am inclined (admittedly without deep
> study) to agree with Peter Eisentraut. We have never canonicalized
> collations before and therefore it is not essential that we do that
> now.

As I've said, we're *already* canonicalizing them for ICU. Just not
consistently (across ICU versions, and arguably even within ICU
versions). That's the problem -- we're half way between both
positions.

The problem is most emphatically *not* that we've failed to
canonicalize them in the way that I happen to favor.

> That would be a new feature, and I don't think I'd be prepared
> to endorse adding it three days after feature freeze let alone three
> days before the GA wrap. I do agree that the lack of canonicalization
> is utterly terrible. The APIs that Unix-like operating systems
> provide for collations are poorly suited to our purposes and
> hopelessly squishy about semantics, and it's not clear how much better
> ICU will be.

In one important sense, this is a regression against libc, because you
never had something like en_US.UTF-8 break on downgrading glibc
version (like, when you restore a basebackup on a different OS with
the same arch). Sure, you probably had to REINDEX text indexes, to be
on the safe side, but once you did that there was no question about
the older glibc having never heard of "en_US.UTF-8" as a
LC_COLLATE/collcollate.

I regret that I didn't catch it sooner. It now seems very obvious, and
totally preventable given enough time.

> I simply do not buy the theory that this cannot be changed later.in

It can be changed later, of course -- at greater, though indeterminate cost.

> It's been the case for as long as we've had pg_collate that a new
> system could have different collations than the old one, resulting in
> a dump/restore failure. I expect somebody's had that problem at some
> point, but I don't think it's become a major pain point because most
> people don't use exotic collations, and if they do they probably
> understand that they need those exotic collations to be on the new
> system too.

Like I said, you don't need exotic collations to have the downgrade
problem, unless *any* initdb ICU collation counts as exotic. No CREATE
COLLATION is needed.

> I also believe that Peter Eisentraut is entirely correct to be
> concerned about whether BCP 47 (or anything else) can really be
> regarded as a stable canonical form for ICU purposes. His email
> indicates that the acceptable and canonical forms have changed
> multiple times in the course of releases new enough for us to care
> about them. Assuming that statement is correct, it would be extremely
> short-sighted of us to bank on them not changing any more.

That statement isn't correct. Including even the suggestion that Peter
Eisentraut ever thought it. ICU uses BCP 47 for collation name *across
all versions*. Just not as the collcollate value (that's only the case
on versions of ICU >= 54).

> But even if all of the above argumentation is utterly and completely
> wrong, dredged up from the universe's deepest and most profound
> reserves of stupidity and destined for future entry into Webster's as
> the canonical example of cluelessness, we still shouldn't change it
> the weekend before the GA wraps.

That seems like a value judgement. I'm not going to tell you that
you're wrong. What I will say is that I think we've done poorly here.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-23 03:19:52 Re: BUG #14825: enum type: unsafe use?
Previous Message Andres Freund 2017-09-23 01:28:46 Re: Built-in plugin for logical decoding output