Re: C11: should we use char32_t for unicode code points?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: C11: should we use char32_t for unicode code points?
Date: 2025-10-29 01:00:54
Message-ID: CA+hUKG+hDkp1etcfy=taxJ8ybf8KapyOjqdBRPF7yaoSoSj1_w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 29, 2025 at 6:59 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> So you're saying that pg_wchar is more like a union type?
>
> typedef pg_wchar
> {
> char ch; /* single-byte encodings or
> non-UTF8 encodings on unix */
> char16_t utf16; /* windows non-UTF8 encodings */
> char32_t utf32; /* UTF-8 encoding */
> } pg_wchar;
>
> (we'd have to be careful about the memory layout if we're casting,
> though)

Interesting idea. I think it'd have to be something like:

typedef union
{
unsigned char ch; /* (1) single-byte encoding databases */
char32_t utf32; /* (2) UTF-8 databases */
uint32_t ascii_or_custom; /* (3) MULE, EUC_XX databases */
} pg_wchar;

Dunno if it's worth actually doing, but it's a good illustration and a
better way to explain all this than the wall of text I wrote
yesterday. The collusion between common/wchar.c and pg_locale_libc.c
is made more explicit.

I wonder if the logic to select the member/semantics could be turned
into an enum in the encoding table, to make it even clearer, and then
that could be used as an index into a table of ctype methods obejcts
in _libc.c. The encoding module would be declaring which pg_wchar
semantics it uses, instead of having the _libc.c module infer it from
other properties, for a more explicit contract. Or since they are
inferrable, perhaps a function in the mb module could do that and
return the enum. Hmm, perhaps that alone would be clarifying enough,
without the union type. I'm picturing something like PG_WCHAR_CHAR
(direclty usable with ctype.h), PG_WCHAR_UTF32 (self-explanatory, also
assumed be compatible with UTF-8 locales' wchar_t), PG_WCHAR_CUSTOM
(we only know that ASCII range is sane as Ishii-san explained, and for
anything else you'd need to re-encode via libc or give up, but
preferably not go nuts and return junk). The enum would create a new
central place to document the cross-module semantics.

You showed char16_t for Windows, but we don't ever get char16_t out of
wchar.c, it's always char32_t for UTF-8 input. It's just that _libc.c
truncates to UTF-16 or short-circuits to avoid overflow on that
platform (and in the past AIX 32-bit and maybe more), so it wouldn't
belong in a hypothetical union or enum.

> > To avoid doing hard
> > work for nothing (ideogram-based languages generally don't care about
> > ctype stuff so that'd be the vast majority of characters appearing in
> > Chinese/Japanese/Korean text) at the cost of having to do a bunch of
> > research, we could should short-circuit the core CJK character
> > ranges,
> > and do the extra CPU cycles for the rest,
>
> I don't think we should start making a bunch of assumptions like that.

Yeah, maybe not. Thought process: I had noticed that EUC was the only
relevant encoding family, and it has a character set selector, CS0 =
ASCII, and CS1, CS2, CS3 defined appropriately by the national
variants. I had noticed that at least the Japanese one can encode
Latin with accents, Greek etc (non-ASCII stuff that has a meaningful
isalpha() etc) and I took a wild guess that it might be easy to
distinguish them if they'd chosen to put those under a different CS
number. But I see now that they actually stuffed them all into CS1
along with kanji and kana, making it slightly more difficult: they're
still in different assigned "rows" though. At a guess, you can
probably identify extra punctuation (huh, that's surely relevant even
for pure Japanese text if we want ispunct to work?) and foreign
alphabets with some bitmasks. There might be something similar for
the other EUCs.

It's true that it's really not nice to carry special knowledge like
that (it's not just "assumptions", it's a set of black and white
published standards), and we should probably try hard to avoid that.
Perhaps we could at least put the conversion in a new encoding table
function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
place to put that sort of optimisation in (as opposed to making
_libc.c call char2wchar() with no hope of fast path)... that is, if
we want to do any of this at all and not just make new ctype functions
that return false for PG_WCHAR_CUSTOM with value >= 128 and call it a
day...

If we do develop this idea though, one issue to contemplate is that
EUC code points might generate more than one wchar_t, looking at
EUC_JIS_2004[1]. We'd need a pg_wchar_custom_to_wchar_t() signature
that takes a single pg_wchar and writes to an output array and returns
the count, and then we'd have to decide what to do if we get more than
one. Surrogates are trivial under the existing "punt" doctrine:
Windows went big on Unicode before it grew, C doesn't do wctype for
multi-wchar_t sequences, and we can't fix any of that. If it's a
(rare?) combining character sequence then uhh... same problem one
level up, I think, even on Unix? I'm not sure if we could do much
better than the "punt" path in both cases: return either false or the
input character as appropriate.

> > 3. I assume there are some good reasons we don't do this but... if
> > we
> > used char2wchar() in the first place (= libc native wchar_t) for the
> > regexp stuff that calls this stuff (as we do already inside
> > whole-string upper/lower, just not character upper/lower or character
> > classification), then we could simply call the wchar_t libc functions
> > directly and unconditionally in the libc provider for all cases,
> > instead of the 8-bit variants with broken edge cases for non-UTF-8
> > databases.
>
> I'm not sure about that either, but I think it's because you can end up
> with surrogate pairs, which can't be represented in UTF-8.

Yeah, I think that alone is a good reason. We need PG_WCHAR_UTF32 (in
the sketch terminology above).

I wondered about PG_WCHAR_SYSTEM_WCHAR_T, that could potentially
replace PG_WCHAR_CUSTOM, in other words using system wchar_t but only
for EUC_*. The point of this would be for eg regexes to be able to
convert whole strings up-front with one libc call, rather than calling
for each character. The problem seems to be that you'd lose any
ability to deal with surrogates and combining characters as discussed
above, as you'd lose character synchronisation for want of a better
word. So I just can't see how to make this work. Which leads back to
the do-it-one-by-one idea, which then leads back to the
maybe-try-to-make-a-fast-path-for-kanji-etc idea 'cos otherwise it
sounds too expensive...

[1] https://en.wikipedia.org/wiki/JIS_X_0213

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-10-29 01:07:57 Re: Remove specific \r\n code in TAP for Windows
Previous Message Michael Paquier 2025-10-29 00:24:27 Re: Channel binding for post-quantum cryptography