strcmp() tie-breaker for identical ICU-collated strings

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: strcmp() tie-breaker for identical ICU-collated strings
Date: 2017-06-01 18:58:51
Message-ID: CAJ3gD9ez9O7scYT46iRXU-1KDfDeSQvJ2Ekzxs7RYGofYfB4cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While comparing two text strings using varstr_cmp(), if *strcoll*()
call returns 0, we do strcmp() tie-breaker to do binary comparison,
because strcoll() can return 0 for non-identical strings :

varstr_cmp()
{
...
/*
* In some locales strcoll() can claim that nonidentical strings are
* equal. Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);
...
}

But is this supposed to apply for ICU collations as well ? If
collation provider is icu, the comparison is done using
ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
some characters as being identical, so doing strcmp() may not make
sense.

For e.g. , if the below two characters are compared using
ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
Greek Oxia : UTF-16 encoding : 0x1FFD
(http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
Greek Tonos : UTF-16 encoding : 0x0384
(http://www.fileformat.info/info/unicode/char/0384/index.htm)

The characters are displayed like this :
postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
?column? | ?column?
----------+----------
´ | ΄
(Although this example has similar looking characters, this might not
be a factor behind treating them equal)

Now since ucol_strcoll*() returns 0, these strings are always compared
using strcmp(), so 1FFD > 0384 returns true :

create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
?column?
----------
t

Whereas, if strcmp() is skipped for ICU collations :
if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
result = strcmp(a1p, a2p);

... then the comparison using ICU collation tells they are identical strings :

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
?column?
----------
f
(1 row)

postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
?column?
----------
f
(1 row)

postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
?column?
----------
t

Now I have verified that strcoll() returns true for 1FFD > 0384. So,
it looks like ICU API function ucol_strcoll() returns false by
intention. That's the reason I feel like the
strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
I may be wrong, please correct me if I may be missing something.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Ladhe 2017-06-01 19:35:03 Re: Adding support for Default partition in partitioning
Previous Message Andres Freund 2017-06-01 18:28:46 Re: logical replication busy-waiting on a lock