Re: ICU integration

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU integration
Date: 2017-03-15 16:33:23
Message-ID: 5291804b-169e-3ba9-fdaf-fae8e7d2d959@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Updated patch attached.

On 3/14/17 22:15, Andreas Karlsson wrote:
> I do not like the schema based solution since search_path already gives
> us enough headaches. As for the naming I am fine with the current scheme.

Yeah, it seems we're going to settle on the suffix idea. See below for
an idea that builds on that.

> - I get a test failures in the default test suite due to not having the
> tr_TR locale installed. I would assume that this would be pretty common
> for hackers.

I have removed that test. It seems it's not possible to test that
portably without major contortions.

> - The code no longer compiles without HAVE_LOCALE_T.

Fixed that.

> - I do not like how it is not obvious which is the default version of
> every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
> "sv_standard%icu" as one might expect. Is this inherent to ICU or
> something we can work around?

We get these keywords from ucol_getKeywordValuesForLocale(), which says
"Given a key and a locale, returns an array of string values in a
preferred order that would make a difference." So all those are
supposedly different from each other.

> - ICU talks about a new syntax for locale extensions (old:
> de_DE(at)collation=phonebook, new: de_DE_u_co_phonebk) at this page
> http://userguide.icu-project.org/collation/api. Is this something we
> need to car about? It looks like we currently use the old format, and
> while I personally prefer it I am not sure we should rely on an old syntax.

Interesting. I hadn't see this before, and the documentation is sparse.
But it seems that this refers to BCP 47 language tags, which seem like
a good idea.

So what I have done is change all the predefined ICU collations to be
named after the BCP 47 scheme, with a "private use" -x-icu suffix
(instead of %icu). The preserves the original idea but uses a standard
naming scheme.

I'm not terribly worried that they are going to remove the "old" locale
naming, but just to be forward looking, I have changed it so that the
collcollate entries are made using the "new" naming for ICU >=54.

> - I get an error when creating a new locale.
>
> #CREATE COLLATION sv2 (LOCALE = 'sv');
> ERROR: could not create locale "sv": Success
>
> # CREATE COLLATION sv2 (LOCALE = 'sv');
> ERROR: could not create locale "sv": Resource temporarily unavailable
> Time: 1.109 ms

Hmm, that's pretty straightforward code. What is your operating system?
What are the build options? Does it work without this patch?

> - For the collprovider is it really correct to say that 'd' is the
> default value as it does in catalogs.sgml?

It doesn't say it's the default value, it says it uses the database
default. This is all a bit confusing. We have a collation object named
"default", which uses the locale set for the database. That's been that
way for a while. Now when introducing the collation providers, that
"default" collation gets its own collprovider category 'd'. That is not
really used anywhere, but we have to put something there.

> - I do not know what the policy for formatting the documentation is, but
> some of the paragraphs are in need of re-wrapping.

Hmm, I don't see anything terribly bad.

> - Add a hint to "ERROR: conflicting or redundant options". The error
> message is pretty unclear.

I don't see that in my patch. Example?

> - I am not a fan of this patch comparing the encoding with a -1 literal.
> How about adding -1 as a value to the enum? See the example below for a
> place where the patch compares with -1.

That's existing practice. Not a great practice, probably, but widespread.

> - The patch adds "FIXME" in the below. Is that a left-over from
> development or something which still needs doing?
>
> /*
> * Also forbid matching an any-encoding entry. This test of course
> is not
> * backed up by the unique index, but it's not a problem since we don't
> - * support adding any-encoding entries after initdb.
> + * support adding any-encoding entries after initdb. FIXME
> */

I had mentioned that upthread. It technically needs "doing" as you say,
but it's not clear how and it's not terribly important, arguably.

> - Should functions like normalize_locale_name() be renamed to indicate
> they relate to libc locales? I am leaning towards doing so but have not
> looked closely at the task.

Renamed.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v6-0001-ICU-support.patch application/x-patch 156.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mengxing Liu 2017-03-15 16:35:34 Re: Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions
Previous Message Emre Hasegeli 2017-03-15 16:32:58 Re: Parallel Bitmap scans a bit broken