What users can do with custom ICU collations in Postgres 10

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: What users can do with custom ICU collations in Postgres 10
Date: 2017-08-09 22:49:41
Message-ID: CAH2-Wz=bcgmk97YaZ3rs4OoCdn1nOco1HCfRGBCOOty0ztnCnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

There are actually very many customizations to collations that are
possible beyond what the "stock" ICU collations provide (whatever
"stock" means). Some of these are really cool, and I can imagine use
cases where they are very compelling that have nothing to do with
internationalization (such customizations are how we should eventually
implement case-insensitive collations, once the infrastructure for
doing that without breaking hashing is in place).

I'd like to give a demo on what is already possible, but not currently
documented. I didn't see anyone else comment on this, including Peter
E (maybe I missed that?). We should improve the documentation in this
area, to get this into the hands of users.

Say we're unhappy that numbers come first, which we see here:

postgres=# select * from (select '1a' i union select '1b' union select
'1c' union select 'a1' union select 'b2' union select 'c3') j order by
i collate "en-x-icu";
i
────
1a
1b
1c
a1
b2
c3
(6 rows)

We may do this to get our desired sort order:

postgres=# create collation digitlast (provider=icu,
locale='en-u-kr-latn-digit');
CREATE COLLATION
postgres=# select * from (select '1a' i union select '1b' union select
'1c' union select 'a1' union select 'b2' union select 'c3') j order by
i collate "digitlast";
i
────
a1
b2
c3
1a
1b
1c
(6 rows)

Note that 'kr' is a specific BCP47 Key [1]. Many different options can
be set in this manner.

Let's say we are unhappy with the fact that capital letters sort
higher than lowercase:

postgres=# select * from (select 'B' i union select 'b' union select
'A' union select 'a') j order by i collate "en-x-icu";
i
───
a
A
b
B
(4 rows)

ICU provides a solution here, too:

postgres=# create collation capitalfirst (provider=icu, locale='en-u-kf-upper');
CREATE COLLATION
postgres=#
select * from (select 'B' i union select 'b' union select 'A' union
select 'a') j order by i collate "capitalfirst";
i
───
A
a
B
b
(4 rows)

And, yes -- you can even *combine* these two options by creating a
third custom collation. That can be spelled
'en-u-kf-upper-kr-latn-digit', in case you were wondering.

Users have certainly complained about not liking this or that aspect
of how glibc sorts text many times over the years, particularly around
things like how whitespace and punctuation are handled, which they can
now do something about [2]. Users can also have numbers sort like
numbers should when compared against other numbers, by using the
numericOrdering option (not shown). numericOrdering would be great for
things like alphanumeric invoice numbers, or the alphanumeric car
registration plate numbers that are used in certain countries [3],
with fixed number/letter fields. These options are very powerful.

[1] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[2] http://unicode.org/reports/tr35/tr35-collation.html#Common_Settings
[3] https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_the_Republic_of_Ireland
--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-08-09 23:39:20 Re: Log LDAP "diagnostic messages"?
Previous Message Peter Geoghegan 2017-08-09 21:29:09 Re: Crash report for some ICU-52 (debian8) COLLATE and work_mem values