Re: ICU integration

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU integration
Date: 2016-09-23 06:27:22
Message-ID: CAEepm=2QFfhs9OqSnFmDOKucCooj-5EBmPnF+nyGGxfsUkUnkg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

This is very interesting work, and it's great to see some development
in this area. I've been peripherally involved in more than one
collation-change-broke-my-data incident over the years. I took the
patch for a quick spin today. Here are a couple of initial
observations.

This patch adds pkg-config support to our configure script, in order
to produce the build options for ICU. That's cool, and I'm a fan of
pkg-config, but it's an extra dependency that I just wanted to
highlight. For example MacOSX appears to ship with ICU but has is no
pkg-config or ICU .pc files out of the box (erm, I can't even find the
headers, so maybe that copy of ICU is useless to everyone except
Apple). The MacPorts ICU port ships with .pc files, so that's easy to
deal with, and I don't expect it to be difficult to get a working
pkg-config and meta-data installed alongside ICU on any platform that
doesn't already ship with them. It may well be useful for configuring
other packages. (There is also other cool stuff that would become
possible once ICU is optionally around, off topic.)

There is something missing from the configure script however: the
output of pkg-config is not making it into CFLAGS or LDFLAGS, so
building fails on FreeBSD and MacOSX where for example
<unicode/ucnv.h> doesn't live in the default search path. I tried
very briefly to work out what but autoconfuse defeated me.

You call the built-in strcoll/strxfrm collation provider "posix", and
although POSIX does define those functions, it only does so because it
inherits them from ISO C90. As far as I can tell Windows provides
those functions because it conforms to the C spec, not the POSIX spec,
though I suppose you could argue that in that respect it DOES conform
to the POSIX spec... Also, we already have a collation called
"POSIX". Maybe we should avoid confusing people with a "posix"
provider and a "POSIX" collation? But then I'm not sure what else to
call it, but perhaps "system" as Petr Jelinek suggested[1], or "libc".

postgres=# select * from pg_collation where collname like 'en_NZ%';
┌──────────────────┬───────────────┬───────────┬──────────────┬──────────────┬──────────────────┬──────────────────┬─────────────┐
│ collname │ collnamespace │ collowner │ collprovider │
collencoding │ collcollate │ collctype │ collversion │
├──────────────────┼───────────────┼───────────┼──────────────┼──────────────┼──────────────────┼──────────────────┼─────────────┤
│ en_NZ │ 11 │ 10 │ p │
6 │ en_NZ │ en_NZ │ 0 │
│ en_NZ │ 11 │ 10 │ p │
8 │ en_NZ.ISO8859-1 │ en_NZ.ISO8859-1 │ 0 │
│ en_NZ │ 11 │ 10 │ p │
16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │ 0 │
│ en_NZ.ISO8859-1 │ 11 │ 10 │ p │
8 │ en_NZ.ISO8859-1 │ en_NZ.ISO8859-1 │ 0 │
│ en_NZ.ISO8859-15 │ 11 │ 10 │ p │
16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │ 0 │
│ en_NZ.UTF-8 │ 11 │ 10 │ p │
6 │ en_NZ.UTF-8 │ en_NZ.UTF-8 │ 0 │
│ en_NZ%icu │ 11 │ 10 │ i │
-1 │ en_NZ │ en_NZ │ -1724383232 │
└──────────────────┴───────────────┴───────────┴──────────────┴──────────────┴──────────────────┴──────────────────┴─────────────┘
(7 rows)

I notice that you use encoding -1, meaning "all". I haven't fully
grokked what that really means but it appears that we won't be able to
create any new such collations after initdb using CREATE COLLATION, if
for example you update your ICU installation and now have a new
collation available. When I tried dropping some and recreating them
they finished up with collencoding = 6. Should the initdb-created
rows use 6 too?

+ ucol_getVersion(collator, versioninfo);
+ numversion = ntohl(*((uint32 *) versioninfo));
+
+ if (numversion != collform->collversion)
+ ereport(WARNING,
+ (errmsg("ICU collator version mismatch"),
+ errdetail("The database was created using version 0x%08X, the
library provides version 0x%08X.",
+ (uint32) collform->collversion, (uint32) numversion),
+ errhint("Rebuild affected indexes, or build PostgreSQL with the
right version of ICU.")));

I played around with bumping version numbers artificially. That gives
each session that accesses the collation one warning:

postgres=# select * from foo order by id;
WARNING: ICU collator version mismatch
DETAIL: The database was created using version 0x99380001, the
library provides version 0x99380000.
HINT: Rebuild affected indexes, or build PostgreSQL with the right
version of ICU.
┌────┐
│ id │
├────┤
└────┘
(0 rows)

postgres=# select * from foo order by id;
┌────┐
│ id │
├────┤
└────┘
(0 rows)

The warning persists even after I reindex all affected tables (and
start a new backend), because you're only recording the collation at
pg_collation level and then only setting it at initdb time, so that
HINT isn't much use for clearing the warning. I think you'd need to
record a snapshot of the version used for each collation that was used
on *each index*, and update it whenever you CREATE INDEX or REINDEX
etc. There can of course be more than one collation and thus version
involved, if you have columns with different collations, so I guess
you'd need an column to hold an array of version snapshots whose order
corresponds to pg_index.indcollation's.

I contemplated something like that for the built-in libc collations,
based on extensions or external scripts that would know how to
checksum the collation definition files for your particular operating
system[2] and track them per index (and per column) in a user table.
That's not material for a core feature but it did cause me to think
about this problem a little bit. That was based on the idea that
you'd run a script periodically or at least after OS upgrades, and
either have it run index automatically or email you a set of command
to run off hours.

Obviously you can't refuse to come up on mismatch, or you'll never be
able to REINDEX. Automatically launched unscheduled REINDEXes in core
would be antisocial and never fly. But a warning could easily go
unnoticed leading to corruption (not only reads, but also UNIQUE
CONSTRAINTs not enforced, yada yada yada). I wonder what else we
could reasonably do...

A couple of thoughts about abbreviated keys:

#ifndef TRUST_STRXFRM
if (!collate_c)
abbreviate = false;
#endif

I think this macro should affect only strxfrm, and we should trust
ucol_getSortKey or disable it independently. ICU's manual says
reassuring things like "Sort keys are most useful in databases" and
"Sort keys are generally only useful in databases or other
circumstances where function calls are extremely expensive".

It looks like varstr_abbrev_convert calls strxfrm unconditionally
(assuming TRUST_STRXFRM is defined). <captain-obvious>This needs to
use ucol_getSortKey instead when appropriate.</> It looks like it's a
bit more helpful than strxfrm about telling you the output buffer size
it wants, and it doesn't need nul termination, which is nice.
Unfortunately it is like strxfrm in that the output buffer's contents
is unspecified if it ran out of space.

+++ b/src/test/regress/expected/collate.icu.out
@@ -0,0 +1,1076 @@
+/*
+ * This test is for Linux/glibc systems and assumes that a full set of
+ * locales is installed. It must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */

Should say ICU.

[1] https://www.postgresql.org/message-id/3113bd83-9b3a-a95b-cf24-4b5b1dc6666f%402ndquadrant.com
[2] https://github.com/macdice/check_pg_collations

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2016-09-23 07:42:54 Re: PL/Python adding support for multi-dimensional arrays
Previous Message Pavel Stehule 2016-09-23 06:21:21 Re: Showing parallel status in \df+