Re: C11: should we use char32_t for unicode code points?

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: C11: should we use char32_t for unicode code points?
Date: 2025-10-28 17:59:26
Message-ID: 8e5e4892c5c6b1b031b8e715dd254f21d9fb2bd9.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2025-10-28 at 15:40 +1300, Thomas Munro wrote:
> I was noticing that toupper_libc_mb() directly tests if a pg_wchar
> value is in the ASCII range, which only makes sense given knowledge
> of
> pg_wchar's encoding, so perhap that should trigger this new coding
> rule.  But I agree that's pretty obscure...  feel free to ignore that
> suggestion.

I'm not sure that casting it to char32_t would be an improvement there.
Perhaps if we can find some ways to generally clarify things (some of
which you suggest below), that could be part of a follow-up.

It looks like the current patch is a step in the right direction, so
I'll commit that soon and see what the buildfarm says.

> Hmm, the comment at the top explains that we apply that special ASCII
> treatment for default locales and not non-default locales, but it
> doesn't explain *why* we make that distinction.  Do you know?

It makes some sense: I suppose someone thought that non-ASCII behavior
in the default locale is just too likely to cause problems. But the
non-ASCII behavior is allowed if you use a COLLATE clause.

But the pattern wasn't followed quite the same way with ICU, which uses
the given locale for UPPER()/LOWER() regardless of whether it's the
default locale or not. And for regexes, ICU doesn't use the locale at
all, it just uses u_isalpha(), etc., even if you use a COLLATE clause.

And there are still some places that call plain tolower()/toupper(),
such as fuzzystrmatch and ltree.

>
> Right, we do know the encoding of pg_wchar in every case (assuming
> that all pg_wchar values come from our transcoding routines).  We
> just
> don't know if that encoding is also the one used by libc's
> locale-sensitive functions that deal in wchar_t, except when the
> locale is one that uses UTF-8 for char encoding, in which case we
> assume that every libc must surely use Unicode codepoints in wchar_t.

Ah, right. We create pg_wchars for any encoding, but we only pass a
pg_wchar to a libc multibyte function in the UTF-8 encoding.

(Aside: we do pass pg_wchars directly to ICU as UTF-32 codepoints,
regardless of encoding, which is a bug.)

> For locales that use UTF-8 for char, we expect libc to understand
> pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16.  The
> expected source of these pg_wchar values is our various regexp code
> paths that will use our mbutils pg_wchar conversion to UTF-32, with a
> reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows
> and I think otherwise only AIX in 32 bit builds, if it comes back).
> If any libc didn't use Unicode codepoints in its locale-sensitive
> wchar_t functions for UTF-8 locales we'd get garbage results, but we
> don't know of any such system.

Check.

>   It's a bit of a shame that C11 didn't
> introduce the obvious isualpha(char32_t) variants for a
> standard-supported version of that realpolitik we depend on, but
> perhaps one day...

Yeah...

> There is one minor quirk here that it might be nice to document in
> top
> comment section 2: on Windows we also expect wchar_t to be understood
> by system wctype functions as UTF-16 for locales that *don't* use
> UTF-8 for char (an assumption that definitely doesn't hold on many
> Unixen).  That is important because on Windows we allow non-UTF-8
> locales to be used in UTF-8 databases for historical reasons.

Interesting.

> For single-byte encodings: pg_latin12wchar_with_len() just
> zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c
> functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it
> completes a perfect round trip inside our code.

So you're saying that pg_wchar is more like a union type?

typedef pg_wchar
{
char ch; /* single-byte encodings or 
non-UTF8 encodings on unix */
char16_t utf16; /* windows non-UTF8 encodings */
char32_t utf32; /* UTF-8 encoding */
} pg_wchar;

(we'd have to be careful about the memory layout if we're casting,
though)

>   (BTW
> pg_latin12wchar_with_len() has the same definition as
> pg_ascii2wchar_with_len(), and is used for many single-byte encodings
> other than LATIN1 which makes me wonder why we don't just have a
> single function pg_char2wchar_with_len() that is used by all "simple
> widening" cases.)

Sounds like a nice simplification.

>   We never know or care which encoding libc would
> itself use for these locales' wchar_t, as we don't ever pass it a
> wchar_t.

Ah, that makes sense.

>   Assuming I understood that correctly, I think it would be
> nice if the "100% correct for LATINn" comment stated the reason for
> that certainty explicitly, ie that it closes an information-
> preserving
> round-trip beginning with the coercion in pg_latin12wchar_with_len()
> and that libc never receives a wchar_t/wint_t that we fabricated.

Agreed, though I think some refactoring would be helpful to accompany
the comment. I've worked with this stuff a lot and I still find it hard
to keep everything in mind at once.

> A bit of a digression, which I *think* is out-of-scope for this
> module, but just while I'm working through all the implications: 
> This
> could produce unspecified results if a wchar_t from another source
> ever arrived into these functions

Ugh.

When I first started dealing with pg_wchar, I assumed it was just a
wider wchar_t to abstract away some of the complexity when
sizeof(wchar_t) == 2 (e.g. get rid of surrogate pairs). It's clearly
more complicated than that.

> For multi-byte encodings other than UTF-8, pg_locale_libc.c is
> basically giving up almost completely

Right.

> I
> believe we can ignore MULE internal, as no libc supports it (so you
> could only get here with the C locale where you'll get the garbage
> results you asked for...  in fact I wonder why need MULE internal at
> all... it seems to be a sort of double-encoding for multiplexing
> other
> encodings, so we can't exactly say it's not blessed by a standard,
> it's indirectly defined by "all the standards" in a sense, but it's
> also entirely obsoleted by Unicode's unification so I don't know what
> problem it solves for anyone, or if anyone ever needed it in any
> reasonable pg_upgrade window of history...).

I have never heard of someone using it in production, and I wouldn't
object if someone wants to deprecate it.

> 2.  More expensive but complete: handle ASCII range with existing
> 8-bit ctype functions, and otherwise convert our pg_wchar back to MB
> char format and then use libc's mbstowcs_l() to make a wchar_t that
> libc's wchar_t-based functions should understand.

Correct. Sounds painful, but perhaps we could just do it and measure
the performance.

>   To avoid doing hard
> work for nothing (ideogram-based languages generally don't care about
> ctype stuff so that'd be the vast majority of characters appearing in
> Chinese/Japanese/Korean text) at the cost of having to do a bunch of
> research, we could should short-circuit the core CJK character
> ranges,
> and do the extra CPU cycles for the rest,

I don't think we should start making a bunch of assumptions like that.

> 3.  I assume there are some good reasons we don't do this but... if
> we
> used char2wchar() in the first place (= libc native wchar_t) for the
> regexp stuff that calls this stuff (as we do already inside
> whole-string upper/lower, just not character upper/lower or character
> classification), then we could simply call the wchar_t libc functions
> directly and unconditionally in the libc provider for all cases,
> instead of the 8-bit variants with broken edge cases for non-UTF-8
> databases.

I'm not sure about that either, but I think it's because you can end up
with surrogate pairs, which can't be represented in UTF-8.

>   I didn't try to find the historical discussions, but I can
> imagine already that we might not have done that because it has to
> copy to cope with non-NULL-terminated strings,

That's probably another reason.

> and it would only be appropriate for libc locales anyway and
> yet now we have other locale providers that certainly don't want some
> unspecified wchar_t encoding or libc involved.

We could fix that by making some of these APIs take a char pointer
instead. That would allow libc to decode to wchar_t, and other
providers to decode to UTF-32. Or, we could say that pg_wchar is an
opaque type that can only be created by the provider, and passed back
to the same provider.

>   It's also likely that
> non-UTF-8 systems are of dwindling interest to anyone outside perhaps
> client encodings

That's been my experience -- haven't run into many non-UTF8 server
encodings.

> In passing, I wonder why _libc.c has that comment about ICU in
> parentheses.  Not relevant here.

I moved it in 4da12e9e2e.

>   I haven't thought much about whether
> it's relevant in the ICU provider code (it may come back to that
> do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it
> also applies to Windows and probably glibc in the libc provider and I
> don't immediately see any problem (assuming no-we-don't! answer).

It's relevant for the regc_wc_isalpha(), etc. functions:

https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mihail Nikalayeu 2025-10-28 18:06:00 Re: [BUG?] check_exclusion_or_unique_constraint false negative
Previous Message Bertrand Drouvot 2025-10-28 17:57:54 Re: Consistently use the XLogRecPtrIsInvalid() macro