Re: ICU integration

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU integration
Date: 2017-03-15 02:15:12
Message-ID: 47e95371-2f86-d5b7-8dd3-88944cd14b94@proxel.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/09/2017 10:13 PM, Peter Eisentraut wrote:
>> - Naming of collations: Are we happy with the "de%icu" naming? I might
>> have come up with that while reviewing the IPv6 zone index patch. ;-)
>> An alternative might be "de$icu" for more Oracle vibe and avoiding the
>> need for double quotes in some cases. (But we have mixed-case names
>> like "de_AT%icu", so further changes would be necessary to fully get rid
>> of the need for quoting.) A more radical alternative would be to
>> install ICU locales in a different schema and use the search_path, or
>> even have a separate search path setting for collations only. Which
>> leads to ...

I do not like the schema based solution since search_path already gives
us enough headaches. As for the naming I am fine with the current scheme.

Would be nice with something we did not have to quote, but I do not like
using dollar signs since they are already use for other things.

>> - Selecting default collation provider: Maybe we want a setting, say in
>> initdb, to determine which provider's collations get the "good" names?
>> Maybe not necessary for this release, but something to think about.

This does not seem necessary for the initial release.

>> - Currently (in this patch), we check a collation's version when it is
>> first used. But, say, after pg_upgrade, you might want to check all of
>> them right away. What might be a good interface for that? (Possibly,
>> we only have to check the ones actually in use, and we have dependency
>> information for that.)

How about adding a SQL level function for checking this which can be
called by pg_upgrade?

= Review

Here is an initial review. I will try to find some time to do more
testing later this week.

This is a really useful feature given the poor quality of collation
support libc. Just that ICU versions the encodings is huge, and the
larger range of available collations with high configurability.
Additionally being able to use abbreviated keys again would be huge.

ICU also supports writing your own collations and modifying existing
ones, something which might be possible to expose in the future. In
general ICU offers a lot for users who care about the details of text
collation.

== Functional

- Generally things seem to work fine and as expected.

- I get a test failures in the default test suite due to not having the
tr_TR locale installed. I would assume that this would be pretty common
for hackers.

***************
*** 428,443 ****

-- to_char
SET lc_time TO 'tr_TR';
SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
to_char
-------------
! 01 NIS 2010
(1 row)

SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr%icu");
to_char
-------------
! 01 NİS 2010
(1 row)

-- backwards parsing
--- 428,444 ----

-- to_char
SET lc_time TO 'tr_TR';
+ ERROR: invalid value for parameter "lc_time": "tr_TR"
SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
to_char
-------------
! 01 APR 2010
(1 row)

SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr%icu");
to_char
-------------
! 01 APR 2010
(1 row)

-- backwards parsing

======================================================================

- The code no longer compiles without HAVE_LOCALE_T.

- I do not like how it is not obvious which is the default version of
every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
"sv_standard%icu" as one might expect. Is this inherent to ICU or
something we can work around?

- ICU talks about a new syntax for locale extensions (old:
de_DE(at)collation=phonebook, new: de_DE_u_co_phonebk) at this page
http://userguide.icu-project.org/collation/api. Is this something we
need to car about? It looks like we currently use the old format, and
while I personally prefer it I am not sure we should rely on an old syntax.

- I get an error when creating a new locale.

#CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR: could not create locale "sv": Success

# CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR: could not create locale "sv": Resource temporarily unavailable
Time: 1.109 ms

== Code review

- For the collprovider is it really correct to say that 'd' is the
default value as it does in catalogs.sgml?

- I do not know what the policy for formatting the documentation is, but
some of the paragraphs are in need of re-wrapping.

- Add a hint to "ERROR: conflicting or redundant options". The error
message is pretty unclear.

- I am not a fan of this patch comparing the encoding with a -1 literal.
How about adding -1 as a value to the enum? See the example below for a
place where the patch compares with -1.

ereport(NOTICE,
(errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("collation \"%s\" for encoding \"%s\" already
exists, skipping",
- collname, pg_encoding_to_char(collencoding))));
+ collencoding == -1
+ ? errmsg("collation \"%s\" already exists, skipping",
+ collname)
+ : errmsg("collation \"%s\" for encoding \"%s\" already
exists, skipping",
+ collname, pg_encoding_to_char(collencoding))));
return InvalidOid;

- The patch adds "FIXME" in the below. Is that a left-over from
development or something which still needs doing?

/*
* Also forbid matching an any-encoding entry. This test of course
is not
* backed up by the unique index, but it's not a problem since we don't
- * support adding any-encoding entries after initdb.
+ * support adding any-encoding entries after initdb. FIXME
*/

- Should functions like normalize_locale_name() be renamed to indicate
they relate to libc locales? I am leaning towards doing so but have not
looked closely at the task.

Andreas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-03-15 02:22:15 Remove obsolete text from hash/README
Previous Message Massimo Fidanza 2017-03-15 02:08:22 New procedural language