Re: case insensitive collation of Greek's sigma

From: Jakub Jedelsky <jakub(dot)jedelsky(at)gooddata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org, Jan Chochol <jan(dot)chochol(at)gooddata(dot)com>
Subject: Re: case insensitive collation of Greek's sigma
Date: 2021-12-02 13:26:39
Message-ID: CAC1JxDQi+z47rdv1szaxyrhAL8-wheZgTggjdj5AQAL4F=xR7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com> writes:
> > Running lower() like this is really the wrong thing to do. We should be
> > doing "case folding" instead, which normalizes these differences for the
> > purpose of case-insensitive comparisons.
>
> That just begs the question: if tolower (or towlower) isn't the
> appropriate API, what is? Perhaps ICU has something for a more
> generalized notion of case-similarity, but I'm not aware of any such
> thing in the POSIX API.
>
> BTW, I think it's only accidental that the regex example shown upthread
> gets the right answer. In that example, what's happening is that we
> consider a letter in a case-insensitive regex to match itself, or
> tolower() of itself, or toupper() of itself. Both σ and ς have Σ
> as toupper() so they both work. But if you'd written Σ in the regex,
> only one of σ and ς would match that as a data character. (Haven't
> actually tested this, but given the way the code works I'm pretty
> sure it's so.) Again, it's hard to see how to do better atop a POSIX
> locale library.
>

Thanks for digging into the issue.

Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it
possible to keep current behaviour with lowercase in POSIX as a fallback
and have the correct solution for ICU? I think (not an expert though) there
should be already working code for case folding for some time already.

[1] https://www.gnu.org/software/libunistring/
"""
Text files are nowadays usually encoded in Unicode, and may consist of very
different scripts – from Latin letters to Chinese Hanzi –, with many kinds
of special characters – accents, right-to-left writing marks, hyphens,
Roman numbers, and much more. But the POSIX platform APIs for text do not
contain adequate functions for dealing with particular properties of many
Unicode characters. In fact, the POSIX APIs for text have several
assumptions at their base which don't hold for Unicode text.
"""

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gianni Ceccarelli 2021-12-02 14:04:04 Re: case insensitive collation of Greek's sigma
Previous Message Avi Weinberg 2021-12-02 10:11:24 Logical Replication - When to Enable Disabled Subscription and When to Create a New One