ICU integration

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: ICU integration
Date: 2016-08-31 02:46:31
Message-ID: 85364fde-091f-bbc0-fec2-e3ede39840a6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here is a patch I've been working on to allow the use of ICU for sorting
and other locale things.

This is mostly complementary to the existing FreeBSD ICU patch, most
recently discussed in [0]. While that patch removes the POSIX locale
use and replaces it with ICU, my interest was on allowing the use of
both. I think that is necessary for upgrading, compatibility, and maybe
because someone likes it.

What I have done is extend collation objects with a collprovider column
that tells whether the collation is using POSIX (appropriate name?) or
ICU facilities. The pg_locale_t type is changed to a struct that
contains the provider-specific locale handles. Users of locale
information are changed to look into that struct for the appropriate
handle to use.

In initdb, I initialize the default collation set as before from the
`locale -a` output, but also add all available ICU locales with a "%icu"
appended (so "fr_FR%icu"). I suppose one could create a configuration
option perhaps in initdb to change the default so that, say, "fr_FR"
uses ICU and "fr_FR%posix" uses the old stuff.

That all works well enough for named collations and for sorting. The
thread about the FreeBSD ICU patch discusses some details of how to best
use the ICU APIs to do various aspects of the sorting, so I didn't focus
on that too much. I took the existing collate.linux.utf8.sql test and
ported it to the ICU setup, and it passes except for the case noted below.

I'm not sure how well it will work to replace all the bits of LIKE and
regular expressions with ICU API calls. One problem is that ICU likes
to do case folding as a whole string, not by character. I need to do
more research about that. Another problem, which was also previously
discussed is that ICU does case folding in a locale-agnostic manner, so
it does not consider things such as the Turkish special cases. This is
per Unicode standard modulo weasel wording, but it breaks existing tests
at least.

So right now the entries in collcollate and collctype need to be valid
for ICU *and* POSIX for everything to work.

Also note that ICU locales are encoding-independent and don't support a
separate collcollate and collctype, so the existing catalog structure is
not optimal.

Where it gets really interesting is what to do with the database
locales. They just set the global process locale. So in order to port
that to ICU we'd need to check every implicit use of the process locale
and tweak it. We could add a datcollprovider column or something. But
we also rely on the datctype setting to validate the encoding of the
database. Maybe we wouldn't need that anymore, but it sounds risky.

We could have a datcollation column that by OID references a collation
defined inside the database. With a background worker, we can log into
the database as it is being created and make adjustments, including
defining or adjusting collation definitions. This would open up
interesting new possibilities.

What is a way to go forward here? What's a minimal useful feature that
is future-proof? Just allow named collations referencing ICU for now?
Is throwing out POSIX locales even for the process locale reasonable?

Oh, that case folding code in formatting.c needs some refactoring.
There are so many ifdefs there and it's repeated almost identically
three times, it's crazy to work in that.

[0]:
https://www.postgresql.org/message-id/flat/789A2F56-0E42-409D-A840-6AF5110D6085%40pingpong.net

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
icu-integration.patch text/x-patch 113.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-08-31 03:27:53 Re: ICU integration
Previous Message Peter Eisentraut 2016-08-31 01:50:05 autonomous transactions