Re: strcmp() tie-breaker for identical ICU-collated strings

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strcmp() tie-breaker for identical ICU-collated strings
Date: 2017-06-01 21:24:53
Message-ID: CAEepm=3nmZj6AAFn7CjCwHw_59nrP+2c58ryn5fhS4C9PWggMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 2, 2017 at 6:58 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> While comparing two text strings using varstr_cmp(), if *strcoll*()
> call returns 0, we do strcmp() tie-breaker to do binary comparison,
> because strcoll() can return 0 for non-identical strings :
>
> varstr_cmp()
> {
> ...
> /*
> * In some locales strcoll() can claim that nonidentical strings are
> * equal. Believing that would be bad news for a number of reasons,
> * so we follow Perl's lead and sort "equal" strings according to
> * strcmp().
> */
> if (result == 0)
> result = strcmp(a1p, a2p);
> ...
> }
>
> But is this supposed to apply for ICU collations as well ? If
> collation provider is icu, the comparison is done using
> ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
> some characters as being identical, so doing strcmp() may not make
> sense.
>
> For e.g. , if the below two characters are compared using
> ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
> Greek Oxia : UTF-16 encoding : 0x1FFD
> (http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
> Greek Tonos : UTF-16 encoding : 0x0384
> (http://www.fileformat.info/info/unicode/char/0384/index.htm)
>
> The characters are displayed like this :
> postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
> ?column? | ?column?
> ----------+----------
> ´ | ΄
> (Although this example has similar looking characters, this might not
> be a factor behind treating them equal)
>
> Now since ucol_strcoll*() returns 0, these strings are always compared
> using strcmp(), so 1FFD > 0384 returns true :
>
> create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');
>
> postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
> ?column?
> ----------
> t
>
> Whereas, if strcmp() is skipped for ICU collations :
> if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
> result = strcmp(a1p, a2p);
>
> ... then the comparison using ICU collation tells they are identical strings :
>
> postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
> ?column?
> ----------
> f
> (1 row)
>
> postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
> ?column?
> ----------
> f
> (1 row)
>
> postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
> ?column?
> ----------
> t
>
>
> Now I have verified that strcoll() returns true for 1FFD > 0384. So,
> it looks like ICU API function ucol_strcoll() returns false by
> intention. That's the reason I feel like the
> strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
> I may be wrong, please correct me if I may be missing something.

I may not have had enough coffee yet, but...

Why should ICU be any different than the system provider in this
respect? In both cases, we have a two-level comparison: first we use
the collation-aware comparison, and then as a tie breaker, we use a
binary comparison. If we didn't do a binary comparison as a
tie-breaker, wouldn't the result be logically incompatible with the =
operator, which does a binary comparison?

Put another way, if we didn't use binary order tie-breaking, we'd have
to teach texteq to understand collations (ie be defined as not (a < b)
and not (b > a)) otherwise we'd permit contradictions like a != b and
not (a < b) and not (b > a).

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-06-01 21:27:08 Re: strcmp() tie-breaker for identical ICU-collated strings
Previous Message Andres Freund 2017-06-01 21:23:28 Re: [HACKERS] Concurrent ALTER SEQUENCE RESTART Regression