Quick Links

Re: case insensitive collation of Greek's sigma

From:	Jakub Jedelsky <jakub(dot)jedelsky(at)gooddata(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org, Jan Chochol <jan(dot)chochol(at)gooddata(dot)com>
Subject:	Re: case insensitive collation of Greek's sigma
Date:	2021-12-02 13:26:39
Message-ID:	CAC1JxDQi+z47rdv1szaxyrhAL8-wheZgTggjdj5AQAL4F=xR7w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com> writes:
> > Running lower() like this is really the wrong thing to do. We should be
> > doing "case folding" instead, which normalizes these differences for the
> > purpose of case-insensitive comparisons.
>
> That just begs the question: if tolower (or towlower) isn't the
> appropriate API, what is? Perhaps ICU has something for a more
> generalized notion of case-similarity, but I'm not aware of any such
> thing in the POSIX API.
>
> BTW, I think it's only accidental that the regex example shown upthread
> gets the right answer. In that example, what's happening is that we
> consider a letter in a case-insensitive regex to match itself, or
> tolower() of itself, or toupper() of itself. Both σ and ς have Σ
> as toupper() so they both work. But if you'd written Σ in the regex,
> only one of σ and ς would match that as a data character. (Haven't
> actually tested this, but given the way the code works I'm pretty
> sure it's so.) Again, it's hard to see how to do better atop a POSIX
> locale library.
>

Thanks for digging into the issue.

Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it
possible to keep current behaviour with lowercase in POSIX as a fallback
and have the correct solution for ICU? I think (not an expert though) there
should be already working code for case folding for some time already.

[1] https://www.gnu.org/software/libunistring/
"""
Text files are nowadays usually encoded in Unicode, and may consist of very
different scripts – from Latin letters to Chinese Hanzi –, with many kinds
of special characters – accents, right-to-left writing marks, hyphens,
Roman numbers, and much more. But the POSIX platform APIs for text do not
contain adequate functions for dealing with particular properties of many
Unicode characters. In fact, the POSIX APIs for text have several
assumptions at their base which don't hold for Unicode text.
"""

In response to

Re: case insensitive collation of Greek's sigma at 2021-12-01 19:49:24 from Tom Lane

Responses

Re: case insensitive collation of Greek's sigma at 2021-12-02 14:04:04 from Gianni Ceccarelli

Browse pgsql-general by date

	From	Date	Subject
Next Message	Gianni Ceccarelli	2021-12-02 14:04:04	Re: case insensitive collation of Greek's sigma
Previous Message	Avi Weinberg	2021-12-02 10:11:24	Logical Replication - When to Enable Disabled Subscription and When to Create a New One