Re: ICU integration

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU integration
Date: 2017-03-19 22:31:57
Message-ID: 29b8c74d-4029-b7f4-1dae-6e622e9ce70b@proxel.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/15/2017 05:33 PM, Peter Eisentraut wrote:
> Updated patch attached.

Ok, I applied to patch again and ran the tests. I get a test failure on
make check-world in the pg_dump tests but it can be fixed with the below.

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl
b/src/bin/pg_dump/t/002_pg_dump.pl
index 3cac4a9ae0..7d9c90363b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -2422,7 +2422,7 @@ qr/^\QINSERT INTO test_fifth_table (col1, col2,
col3, col4, col5) VALUES (NULL,
'CREATE COLLATION test0 FROM "C";',
regexp =>
qr/^
- \QCREATE COLLATION test0 (lc_collate = 'C', lc_ctype = 'C');\E/xm,
+ \QCREATE COLLATION test0 (provider = 'libc', locale = 'C');\E/xm,
like => {
binary_upgrade => 1,
clean => 1,

>> - I do not like how it is not obvious which is the default version of
>> every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
>> "sv_standard%icu" as one might expect. Is this inherent to ICU or
>> something we can work around?
>
> We get these keywords from ucol_getKeywordValuesForLocale(), which says
> "Given a key and a locale, returns an array of string values in a
> preferred order that would make a difference." So all those are
> supposedly different from each other.

I believe you are mistaken. The locale "sv" is just an alias for
"sv-u-standard" as far as I can tell. See the definition of the Swedish
locale
(http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/coll/sv.txt)
and there are just three collations: reformed (default), standard,
search (plus eot and emoji which are inherited).

I am also quite annoyed at col-emoji and col-eor (European Ordering
Rules). They are defined at the root and inherited by all languages, but
no language modifies col-emoji for their needs which makes it a foot
gun. See the Danish sorting example below where at least I expected the
same order. For col-eor it makes a bit more sense since I would expect
the locales en_GB-u-col-eot and sv_SE-u-col-eot to collate exactly the same.

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE
"da-x-icu";
x
----
a
k
aa
(3 rows)

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE
"da-u-co-emoji-x-icu";
x
----
a
aa
k
(3 rows)

It seems ICU has made quite the mess here, and I am not sure to what
degree we need to handle it to avoid our users getting confused. I need
to think some of it, and would love input from others. Maybe the right
thing to do is to ignore the issue with col-emoji, but I would want to
do something about the default collations.

>> - ICU talks about a new syntax for locale extensions (old:
>> de_DE(at)collation=phonebook, new: de_DE_u_co_phonebk) at this page
>> http://userguide.icu-project.org/collation/api. Is this something we
>> need to car about? It looks like we currently use the old format, and
>> while I personally prefer it I am not sure we should rely on an old syntax.
>
> Interesting. I hadn't see this before, and the documentation is sparse.
> But it seems that this refers to BCP 47 language tags, which seem like
> a good idea.
>
> So what I have done is change all the predefined ICU collations to be
> named after the BCP 47 scheme, with a "private use" -x-icu suffix
> (instead of %icu). The preserves the original idea but uses a standard
> naming scheme.
>
> I'm not terribly worried that they are going to remove the "old" locale
> naming, but just to be forward looking, I have changed it so that the
> collcollate entries are made using the "new" naming for ICU >=54.

Sounds good.

>> - I get an error when creating a new locale.
>>
>> #CREATE COLLATION sv2 (LOCALE = 'sv');
>> ERROR: could not create locale "sv": Success
>>
>> # CREATE COLLATION sv2 (LOCALE = 'sv');
>> ERROR: could not create locale "sv": Resource temporarily unavailable
>> Time: 1.109 ms
>
> Hmm, that's pretty straightforward code. What is your operating system?
> What are the build options? Does it work without this patch?

This issue is unrelated to ICU. I had forgot to specify provider so the
eorrs are correct (even though that the value of the errno is weird).

>> - For the collprovider is it really correct to say that 'd' is the
>> default value as it does in catalogs.sgml?
>
> It doesn't say it's the default value, it says it uses the database
> default. This is all a bit confusing. We have a collation object named
> "default", which uses the locale set for the database. That's been that
> way for a while. Now when introducing the collation providers, that
> "default" collation gets its own collprovider category 'd'. That is not
> really used anywhere, but we have to put something there.

Ah, I see now. Hm, that is a bit awkward but I cannot think of a cleaner
alternative.

>> - I do not know what the policy for formatting the documentation is, but
>> some of the paragraphs are in need of re-wrapping.
>
> Hmm, I don't see anything terribly bad.

Maybe it is just me who is sensitive to wrapping, but her is an example
which sticks out to me with its 98 character line.

+ <para>
+ A collation object provided by <literal>libc</literal> maps to a
combination
+ of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>
settings. (As
the name would suggest, the main purpose of a collation is to set
<symbol>LC_COLLATE</symbol>, which controls the sort order. But
it is rarely necessary in practice to have an
<symbol>LC_CTYPE</symbol> setting that is different from
<symbol>LC_COLLATE</symbol>, so it is more convenient to collect
these under one concept than to create another infrastructure for
- setting <symbol>LC_CTYPE</symbol> per expression.) Also, a collation
+ setting <symbol>LC_CTYPE</symbol> per expression.) Also, a
<literal>libc</literal> collation
is tied to a character set encoding (see <xref linkend="multibyte">).
The same collation name may exist for different encodings.
</para>

>> - Add a hint to "ERROR: conflicting or redundant options". The error
>> message is pretty unclear.
>
> I don't see that in my patch. Example?

It is for the below. But after some thinking I am fine not fixing it.

# CREATE COLLATION svi (LOCALE = 'sv', LC_COLLATE = 'sv', PROVIDER = icu);
ERROR: conflicting or redundant options

>> - I am not a fan of this patch comparing the encoding with a -1 literal.
>> How about adding -1 as a value to the enum? See the example below for a
>> place where the patch compares with -1.
>
> That's existing practice. Not a great practice, probably, but widespread.

Ok, if it is widespread in the code then let's keep using it. In a case
like this consistency is just as important.

>> - The patch adds "FIXME" in the below. Is that a left-over from
>> development or something which still needs doing?
>>
>> /*
>> * Also forbid matching an any-encoding entry. This test of course
>> is not
>> * backed up by the unique index, but it's not a problem since we don't
>> - * support adding any-encoding entries after initdb.
>> + * support adding any-encoding entries after initdb. FIXME
>> */
>
> I had mentioned that upthread. It technically needs "doing" as you say,
> but it's not clear how and it's not terribly important, arguably.

The comment is no longer true since for ICU we can do that (it is not an
issue though). At the very least this comment needs to be updated.

Andreas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-03-20 00:32:06 Re: [COMMITTERS] pgsql: Add TAP tests for password-based authentication methods.
Previous Message Tom Lane 2017-03-19 21:40:22 Re: [COMMITTERS] pgsql: Fix and simplify check for whether we're running as Windows serv