Re: How can I find a specific collation in pg_collation when using ICU?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, "maumau307(at)gmail(dot)com" <maumau307(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How can I find a specific collation in pg_collation when using ICU?
Date: 2017-08-10 01:43:45
Message-ID: CAH2-Wz=dRsf_qMpHJ5A7pndmpo1sP1MtzrsUpuuNaCCsecCopg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 9, 2017 at 6:19 PM, Tsunakawa, Takayuki
<tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com> wrote:
> From: pgsql-hackers-owner(at)postgresql(dot)org
>> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Peter Eisentraut
>> There are no case-insensitive collations in PostgreSQL (yet).
>
> That's sad news, as I expected ICU to bring its various collation features to PostgreSQL. I hope it will be easy to add them.

The reason it is not easy is that text equality is based on strict
binary equality. We would have to teach hash operator classes about
collations to fix this, and make text_eq() hash strxfrm() or
something. That requires special work. You can ask ICU for case
insensitivity with Postgres 10, but the strcmp() tie breaker within
varstr_cmp() will prevent it from being truly case insensitive.

>> The best explanation of the difference that I can understand is here, under
>> "Why do CJK strings sort incorrectly in Unicode?":
>>
>> https://dev.mysql.com/doc/refman/5.5/en/faqs-cjk.html
>
> Thanks a lot. MysQL seems to have many collations, doesn't it?

Well, it depends on how you define collation, which actually gets
quite complicated with ICU. You can combine certain options together
with great flexibility, it seems (see my e-mail from earlier today --
"What users can do with custom ICU collations in Postgres 10"). Let's
assume that there are 10 distinct options for each locale that each
independently affect sort order for the base collation of the locale.
I believe that there are at least 10 such options that change things,
and maybe a lot more. Theoretically, this means that there are an
absolutely enormous number of possible collations with ICU.

Now, I wouldn't *actually* say that we have many thousands of
collations with ICU, because that doesn't seem like a sensible way to
explain ICU Postgres collations. The way that we think about this from
using libc doesn't work well for ICU. Instead, I would say that we
have hundreds of locales (not collations), each of which support some
variant options (e.g., traditional Spanish sort order, alternative
Japanese sort order, pictographic emoji sorting), with some further
generic options for varying how case it handled, how numbers are
handled, and other things like that.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-10 01:59:39 Re: How can I find a specific collation in pg_collation when using ICU?
Previous Message Thomas Munro 2017-08-10 01:41:05 Re: Log LDAP "diagnostic messages"?