Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Date: 2017-10-02 17:27:21
Message-ID: CAH2-Wz=Tg712z98j-GmHxWyZ0+fs+CT7t-RKcCcT8QPGK9e+Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 2, 2017 at 9:42 AM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> I understand where you are coming from. My experience with developing
> this feature has been that it is quite fragile in some ways. We have
> had this out there for testing for many months, and we have fixed many
> bugs, and follow-up bugs, up to very recently. We don't have good
> automatic test coverage, so we need to rely on user testing. Making
> significant code and design changes a week or two before the final
> release is just too late, even if the changes were undisputed. And this
> wasn't just one specific isolated change, it was a set of overlapping
> changes that seemingly kept changing and expanding.

For the record, I don't accept that summary. We could have not
bothered with the sanitization thing at all, and I wouldn't have cared
very much. I even revised my proposal such that you would never have
needed to do the language tag mapping at all [1] (albeit while
desupporting ICU versions prior to 4.6). And, as recently as 7 days
ago, you yourself said:

"Reading over this code again, it is admittedly not quite clear why
this "canonicalization" is in there right now. I think it had
something to do with how we built the keyword variants at one point.
It might not make sense. I'd be glad to take that out and use the
result straight from uloc_getAvailable() for collcollate." [2]

This would have been far more radical than what I proposed.

> I'm satisfied that what we are shipping is "good enough". It has been
> in development for over a year, it has been reviewed and tested for
> months, and a lot of credit for that goes to you.

It is "good enough", I suppose. I am disappointed by this particular
outcome, but that's mostly because it could have been avoided. I'm
still very happy that you took on this project, and I don't want that
to be overshadowed by this particular disagreement.

> I'm available to discuss the changes you are envisioning, preferably one
> at a time, in the near future as part of the PG11 development process.

I don't really think that there is anything more to be done about the
issues raised on this thread, with the possible exception of the
DEBUG1 display name string thing. The sanitization just isn't valuable
enough to add after the fact. Apart from the risks of adding it after
the fact, another reason not to bother with it is that it's
*incredibly* forgiving. So forgiving that you could reasonably argue
that we might as well not have it at all.

I may well do more on ICU with you in the future, but not in the area
of how things are canonicalized or sanitized. It's too late for that
now.

[1] https://www.postgresql.org/message-id/CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=OsxE+O3KQ4g@mail.gmail.com
[2] https://www.postgresql.org/message-id/f6c0fca7-e277-3f46-c0c1-adc001bffdd7@2ndquadrant.com
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2017-10-02 17:55:52 Re: list of credits for release notes
Previous Message Pavel Stehule 2017-10-02 17:18:22 Re: issue: record or row variable cannot be part of multiple-item INTO list