Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
Date: 2017-08-07 22:21:18
Message-ID: CAH2-Wz=pA+ViKfPxGyBvyc41H4FhdHp=HUrmK9CDfnYdziXziQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 7, 2017 at 2:50 PM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> On 8/6/17 20:07, Peter Geoghegan wrote:
>> I've looked into this. I'll give an example of what keyword variants
>> there are for Greek, and then discuss what I think each is.
>
> I'm not sure why we want to get into editorializing this. We query ICU
> for the names of distinct collations and use that.

We ask ucol_getKeywordValuesForLocale() to get only "commonly used
[variant] values with the given locale" within
pg_import_system_collations(). So the editorializing has already
begun.

> It's more than most
> people need, sure, but it doesn't cost us anything.

It's also *less* than what other users need. I disagree on the cost of
redundancy among entries after initdb. It's just confusing to users,
and seems avoidable without adding special case logic. What's the
difference between el-u-co-standard-x-icu and el-x-icu?

> The alternatives
> are hand-maintaining a list of collations, or installing no collations
> by default.

A better alternative would be to actively take an interest in what
collations are created, by further refining the rules by which they
are created. We have a stable API, described by various standards,
that we can work with for this. This doesn't have to be a
maintainability burden. We can provide general guidance about how to
add stuff back within documentation.

I do think that we should actually list all the collations that are
available by default on some representative ICU version, once that
list is tightened up, just as other database systems list them. That
necessitates a little weasel wording that notes that later ICU
versions might add more, but that's not a problem IMV. I don't think
that CLDR will ever omit anything previously available, at least
within a reasonable timeframe [1].

[1] http://cldr.unicode.org/index/process/cldr-data-retention-policy
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-07 22:23:56 Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
Previous Message Tom Lane 2017-08-07 22:15:11 Re: max_files_per_processes vs others uses of file descriptors