Re: strcmp() tie-breaker for identical ICU-collated strings

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strcmp() tie-breaker for identical ICU-collated strings
Date: 2017-06-09 06:27:16
Message-ID: CAJ3gD9dYK42pEFWhLfeGny5fnY9PysR-ywZ=MWKtiewP=tPK-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2 June 2017 at 23:52, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> Ok. I was thinking we are doing the tie-breaker because specifically
>> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
>> that we do that to be compatible with texteq().
>
> Both of these explanations are correct, in a way. See commit 656beff.
>
>> Secondly, I was also considering if ICU especially has a way to
>> customize an ICU locale by setting some attributes which dictate
>> comparison or sorting rules for a set of characters. I mean, if there
>> is such customized ICU locale defined in the system, and we use that
>> to create PG collation, I thought we might have to strictly follow
>> those rules without a tie-breaker, so as to be 100% conformant to ICU.
>> I can't come up with an example, or may there isn't one, but , say ,
>> there is a locale which is supposed to sort only by lowest comparison
>> strength (de(at)strength=1 ?? ). In that case, there might be many
>> characters considered equal, but PG < operator or > operator would
>> still return true for those chars.
>
> In the terminology of the Unicode collation algorithm, PostgreSQL
> "forces deterministic comparisons" [1]. There is a lot of information
> on the details of that within the UCA spec.
>
> If we ever wanted to offer a case insensitive collation feature, then
> we wouldn't necessarily have to do the equivalent of a full strxfrm()
> when hashing, at least with collations controlled by ICU. Perhaps we
> could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY
> to build binary sort keys, and leave the rest to a ucol_equal() call
> (within texteq()) that has the usual UCOL_STRENGTH for the underlying
> PostgreSQL collation.
>
> I don't think it would be possible to implement case insensitive
> collations by using some pre-existing ICU collation that is case
> insensitive. Instead, an implementation might directly vary collation
> strength of any given collation to achieve case insensitivity.
> PostgreSQL would know that this collation was case insensitive, so
> regular collations wouldn't need to change their
> behavior/implementation (to use ucol_equal() within texteq(), and so
> on).

Ah ok. Understood, thanks.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2017-06-09 07:05:48 Re: How to refer to resource files from UDFs written in C
Previous Message Michael Paquier 2017-06-09 06:23:28 Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog