Re: Refactor to introduce pg_strcoll().

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Refactor to introduce pg_strcoll().
Date: 2023-03-05 22:20:48
Message-ID: CA+hUKG+BgA7nXBW22hZR2c1c=kBazZiojxotnYe1PjFMj1ELMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

+ /* Win32 does not have UTF-8, so we need to map to UTF-16 */

I wonder if this is still true. I think in Windows 10+ you can enable
UTF-8 support. Then could you use strcoll_l() directly? I struggled
to understand that, but I am a simple Unix hobbit from the shire so I
dunno. (Perhaps the *whole OS* has to be in that mode, so you might
have to do a runtime test? This was discussed in another thread that
mostly left me confused[1].).

And that leads to another thought. We have an old comment
"Unfortunately, there is no strncoll(), so ...". Curiously, Windows
does actually have strncoll_l() (as do some other libcs out there).
So after skipping the expansion to wchar_t, one might think you could
avoid the extra copy required to nul-terminate the string (and hope
that it doesn't make an extra copy internally, far from given).
Unfortunately it seems to be defined in a strange way that doesn't
look like your pg_strncoll_XXX() convention: it has just one length
parameter, not one for each string. That is, it's designed for
comparing prefixes of strings, not for working with
non-null-terminated strings. I'm not entirely sure if the interface
makes sense at all! Is it measuring in 'chars' or 'encoded
characters'? I would guess the former, like strncpy() et al, but then
what does it mean if it chops a UTF-8 sequence in half? And at a
higher level, if you wanted to use it for our purpose, you'd
presumably need Min(s1_len, s2_len), but I wonder if there are string
pairs that would sort in a different order if the collation algorithm
could see more characters after that? For example, in Dutch "ij" is
sometimes treated like a letter that sorts differently than "i" + "j"
normally would, so if you arbitrarily chop that "j" off while
comparing common-length prefix you might get into trouble; likewise
for "aa" in Danish. Perhaps these sorts of problems explain why it's
not in the standard (though I see it was at some point in some kind of
draft; I don't grok the C standards process enough to track down what
happened but WG20/WG14 draft N1027[2] clearly contains strncoll_l()
alongside the stuff that we know and use today). Or maybe I'm
underthinking it.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
[2] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1027.pdf

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-03-05 23:32:49 Re: [PATCH] Add CANONICAL option to xmlserialize
Previous Message Jim Jones 2023-03-05 22:20:19 Re: [PATCH] Add CANONICAL option to xmlserialize