Re: encoding affects ICU regex character classification

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: encoding affects ICU regex character classification
Date: 2023-12-09 21:39:37
Message-ID: CA+hUKGKqS6MntnF33Uzspx9=Ac5ronKnvqJ_hqL2Hx41xLiuKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 2, 2023 at 9:49 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> Your definition is too wide in my opinion, because it mixes together
> different sources of variation that are best left separate:
> a. region/language
> b. technical requirements
> c. versioning
> d. implementation variance
>
> (a) is not a true source of variation (please correct me if I'm wrong)
>
> (b) is perhaps interesting. The "C" locale is one example, and perhaps
> there are others, but I doubt very many others that we want to support.
>
> (c) is not a major concern in my opinion. The impact of Unicode changes
> is usually not dramatic, and it only affects regexes so it's much more
> contained than collation, for example. And if you really care, just use
> the "C" locale.
>
> (d) is mostly a bug

I get you. I was mainly commenting on what POSIX APIs allow, which is
much wider than what you might observe on <your local libc>, and also
end-user-customisable. But I agree that Unicode is all-pervasive and
authoritative in practice, to the point that if your libc disagrees
with it, it's probably just wrong. (I guess site-local locales were
essential for bootstrapping in the early days of computers in a
language/territory but I can't find much discussion of the tools being
used by non-libc-maintainers today.)

> I think we only need 2 main character classification schemes: "C" and
> Unicode (TR #18 Compatibility Properties[1], either the "Standard"
> variant or the "POSIX Compatible" variant or both). The libc and ICU
> ones should be there only for compatibility and discouraged and
> hopefully eventually removed.

How would you specify what you want? As with collating, I like the
idea of keeping support for libc even if it is terrible (some libcs
more than others) and eventually not the default, because I think
optional agreement with other software on the same host is a feature.

In the regex code we see not only class membership tests eg
iswlower_l(), but also conversions eg towlower_l(). Unless you also
implement built-in case mapping, you'd still have to call libc or ICU
for that, right? It seems a bit strange to use different systems for
classification and mapping. If you do implement mapping too, you have
to decide if you believe it is language-dependent or not, I think?

Hmm, let's see what we're doing now... for ICU the regex code is using
"simple" case mapping functions like u_toupper(c) that don't take a
locale, so no Turkish i/İ conversion for you, unlike our SQL
upper()/lower(), which this is supposed to agree with according to the
comments at the top. I see why: POSIX can only do one-by-one
character mappings (which cannot handle Greek's context-sensitive
Σ->σ/ς or German's multi-character ß->SS), while ICU offers only
language-aware "full" string conversation (which does not guarantee
1:1 mapping for each character in a string) OR non-language-aware
"simple" character conversion (which does not handle Turkish's i->İ).
ICU has no middle ground for language-aware mapping with just the 1:1
cases only, probably because that doesn't really make total sense as a
concept (as I assume Greek speakers would agree).

> > > Not knowing anything about how glibc generates its charmaps,
> > > Unicode
> > > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
> > > have an old hand-crafted table, but for UTF-8 encoding it's fully
> > > outsourced to Unicode, and that's why you see a difference.
>
> No, the problem is that we're passing a pg_wchar to an ICU function
> that expects a 32-bit code point. Those two things are equivalent in
> the UTF8 encoding, but not in the LATIN9 encoding.

Ah right, I get that now (sorry, I confused myself by forgetting we
were talking about ICU).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-12-09 21:48:45 Recording whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning (Was: Show various offset arrays for heap WAL records)
Previous Message Andres Freund 2023-12-09 21:06:42 Re: backtrace_on_internal_error