Re: insensitive collations

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Andreas Karlsson" <andreas(at)proxel(dot)se>
Cc: "Peter Eisentraut" <peter(dot)eisentraut(at)2ndquadrant(dot)com>,"pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: insensitive collations
Date: 2019-01-14 16:21:34
Message-ID: ef84c67b-cfa9-4a3f-b0ae-e9ff81e9d948@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andreas Karlsson wrote:

> > Nondeterministic collations do address this by allowing canonically
> > equivalent code point sequences to compare as equal. You still need a
> > collation implementation that actually does compare them as equal; ICU
> > does this, glibc does not AFAICT.
>
> Ah, right! You could use -ks-identic[1] for this.

Strings that differ like that are considered equal even at this level:

postgres=# create collation identic (locale='und-u-ks-identic',
provider='icu', deterministic=false);
CREATE COLLATION

postgres=# select 'é' = E'e\u0301' collate "identic";
?column?
----------
t
(1 row)

There's a separate setting "colNormalization", or "kk" in BCP 47

From
http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting

"The UCA always normalizes input strings into NFD form before the
rest of the algorithm. However, this results in poor performance.
With normalization=off, strings that are in [FCD] and do not contain
Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
correctly. With normalization=on, an implementation that does not
normalize to NFD must at least perform an incremental FCD check and
normalize substrings as necessary"

But even setting this to false does not mean that NFD and NFC forms
of the same text compare as different:

postgres=# create collation identickk (locale='und-u-ks-identic-kk-false',
provider='icu', deterministic=false);
CREATE COLLATION

postgres=# select 'é' = E'e\u0301' collate "identickk";
?column?
----------
t
(1 row)

AFAIU such strings may only compare as different when they're not
in FCD form (http://unicode.org/notes/tn5/#FCD)

There are also ICU-specific explanations about FCD here:
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization

It looks like setting colNormalization to false might provide a
performance benefit when you know your contents are in FCD
form, which is mostly the case according to ICU:

"Note that all NFD strings are in FCD, and in practice most NFC
strings will also be in FCD; for that matter most strings (of whatever
ilk) will be in FCD.
We guarantee that if any input strings are in FCD, that we will get
the right results in collation without having to normalize".

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2019-01-14 16:25:07 Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
Previous Message Tom Lane 2019-01-14 16:08:27 Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's