Re: What users can do with custom ICU collations in Postgres 10

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: What users can do with custom ICU collations in Postgres 10
Date: 2017-08-14 16:36:43
Message-ID: CAH2-WznaO+jA+rNmpHw9c3vXyKiiPSSpktSfOccChRp_98r1Tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 14, 2017 at 9:15 AM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> I'm having trouble finding some concrete documentation for this. The TR
> 35 link you showed documents the key words and values, BCP 47 documents
> the syntax, but nothing puts it all together in a form consumable by
> users. The ICU documentation still mainly focuses on the "old"
> @keyword=value syntax. I guess we'll have to write our own for now.

There is an unusual style to the standards that apply here. It's
incredibly detailed, and the options are very powerful, but it's in an
unfamiliar language. ICU just considers itself a consumer of the CLDR
locale stuff, which is a broad standard.

We don't have to write comprehensive documentation of these
kn/kb/ka/kh options that I pointed out exist. I think it would be nice
to cover a few interesting cases, and link to the BCP 47 Unicode
extension (TR 35) stuff.

Here is a list of scripts, that are all reorderable with this TR 35
stuff (varies somewhat based on CLDR/ICU version):

http://unicode.org/iso15924/iso15924-codes.html

Here is a CLDR specific XML specification of the variant keywords (can
be mapped to specific ICU version easily):

http://www.unicode.org/repos/cldr/tags/release-31/common/bcp47/collation.xml

> Given that we cannot reasonably preload all these new variants that you
> demonstrated, I think it would make sense to drop all the keyword
> variants from the preloaded set.

Cool. While I am of course in favor of this, I actually understand
very well why you had initdb add them. I think that removing them
creates a discoverability problem that cannot easily be fixed through
documentation. ISTM that we ought to also add an SQL-callable function
that lists the most common keyword variants. Some of those are
specific to one or two locales, such as traditional Spanish, or the
alternative sort orders for Han characters.

What do you think of that idea?

I guess an alternative idea is to just link to that XML document
(collation.xml), which exactly specifies the variants. Users can get
the "co" variants there. Should be for the most part obvious which one
is interesting to which locale, since there is not that many "co"
variants to choose from, and users will probably know what to look for
if they look at all.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-08-14 16:51:34 Re: Crash report for some ICU-52 (debian8) COLLATE and work_mem values
Previous Message Andres Freund 2017-08-14 16:36:10 Re: shared memory based stat collector (was: Sharing record typmods between backends)