Re: Inconsistent results with libc sorting on Windows

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Joe Conway <mail(at)joeconway(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inconsistent results with libc sorting on Windows
Date: 2023-06-14 02:13:17
Message-ID: CAH2-WzmtLiQn_4Q5W4GKR8QUiW2MqLEYQtL5EwNDbbPtsSNvsA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 13, 2023 at 5:59 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> Trying to follow along here... you're doing the moral equivalent of
> strxfrm(), so sort keys have the transitive property but direct string
> comparisons don't? Or is this because LCIDs reach a different
> algorithm somehow (or otherwise why do you need to use LCIDs for this,
> when there is a non-LCID version of that function, with a warning not
> to use the older LCID version[1]?)

I'm reminded of the fact that the abbreviated keys strxfrm() debacle
(back when 9.5 was released) was caused by a bug in strcoll() -- not a
bug in strxfrm() itself. From our point of view the problem was that
strxfrm() failed to be bug compatible with strcoll() due to a buggy
strcoll() optimization.

I believe that strxfrm() is generally less likely to have bugs than
strcoll(). There are far fewer opportunities to dodge unnecessary work
in the case of strxfrm()-like algorithms (offering something like
ICU's pg_strnxfrm_prefix_icu() prefix optimization is the only one).
On the other hand, collation library implementers are likely to
heavily optimize strcoll() for typical use-cases such as sorting and
binary search. Using strxfrm() for everything is discouraged [1].

There is an important difference between this issue and the various
glibc collation related bugs that I've come across, though: to the
best of my knowledge there was never a glibc bug that caused strcoll()
to violate transitive consistency -- it always agreed with itself. So
this is a new one on me. Seems like that might make "always use
strxfrm()" (or whatever it's actually called on this platform)
acceptable; strxfrm() can't really violate transitive consistency in
the same way. (I think -- I'm assuming that it'll always produce a
conditioned binary string in a deterministic fashion, since AFAICT
even this buggy strcoll()-like function won't ever give an
inconsistent answer when it compares the same two strings.)

[1] https://unicode-org.github.io/icu/userguide/collation/concepts#sortkeys-vs-comparison
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2023-06-14 02:16:19 Re: make_ctags: use -I option to ignore pg_node_attr macro
Previous Message Amit Kapila 2023-06-14 02:11:01 Re: Non-superuser subscription owners