Re: C11: should we use char32_t for unicode code points?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: C11: should we use char32_t for unicode code points?
Date: 2025-10-25 03:21:28
Message-ID: CA+hUKGJ5Xh0KxLYXDZuPvw1_fHX=yuzb4xxtam1Cr6TPZZ1o+w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 25, 2025 at 4:25 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote:
> > Unless char32_t is solely used for the Unicode code point data, I
> > think it would be better to define something like "pg_unicode" and
> > use
> > it instead of directly using char32_t because it would be cleaner for
> > code readers.
>
> That was my original idea, but then I saw that apparently char32_t is
> intended for Unicode code points:
>
> https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html

It's definitely a codepoint but C11 only promised UTF-32 encoding if
__STDC_UTF_32__ is defined to 1, and otherwise the encoding is
unknown. The C23 standard resolved that insanity and required UTF-32,
and there are no known systems[1] that didn't already conform, but I
guess you could static_assert(__STDC_UTF_32__, "char32_t must use
UTF-32 encoding"). It's also defined as at least, not exactly, 32
bits but we already require the machine to have uint32_t so it must be
exactly 32 bits for us, and we could static_assert(sizeof(char32_t) ==
4) for good measure. So all up, the standard type matches our
existing assumptions about pg_wchar *if* the database encoding is
UTF8.

IIUC you're proposing that all the stuff that only works when database
encoding is UTF8 should be flipped over to the new type, and that
seems like a really good idea to me: remaining uses of pg_wchar would
be warnings that the encoding is only conditionally known. It'd be
documentation without new type safety though: for example I think you
missed a spot, the return type of the definition of utf8_to_unicode()
(I didn't search exhaustively). Only in C++ is it a distinct type
that would catch that and a few other mistakes.

Do you consider explicit casts between eg pg_wchar and char32_t to be
useful documentation for humans, when coercion should just work? I
kinda thought we were trying to cut down on useless casts, they might
signal something but can also hide bugs. Should the few places that
deal in surrogates be using char16_t instead?

I wonder if the XXX_libc_mb() functions that contain our hard-coded
assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
use your to_char32_t() too (probably with a longer name
pg_wchar_to_char32_t() if it's in a header for wider use). That'd
highlight the exact points at which we make that assumption and
centralise the assertion about database encoding, and then the code
that compares with various known cut-off values would be clearly in
the char32_t world.

> But I am also OK with a new type if others find it more readable.

Adding yet another name to this soup doesn't immediately sound like it
would make anything more readable to me. ISO has standardised this
for the industry, so I'd vote for adopting it without indirection that
makes the reader work harder to understand what it is. The churn
doesn't seem excessive either, it's fairly well contained stuff
already moving around a lot in recent releases with all your recent
and ongoing revamping work.

There is one small practical problem though: Apple hasn't got around
to supplying <uchar.h> in its C SDK yet. It's there for C++ only, and
isn't needed for the type in C++ anyway. I don't think that alone
warrants a new name wart, as the standard tells us it must match
uint32_least32_t so we can just define it ourselves if
!defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
around to that.

Since it confused me briefly: Apple does provide <unicode/uchar.h> but
that's a coincidentally named ICU header, and on that subject I see
that ICU hasn't adopted these types yet but there are some hints that
they're thinking about it; meanwhile their C++ interfaces have begun
to document that they are acceptable in a few template functions.

All other target systems have it AFAICS. Windows: tested by CI,
MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.

As for the conversion functions in <uchar.h>, they're of course
missing on macOS but they also depend on the current locale, so it's
almost like C, POSIX and NetBSD have conspired to make them as useless
to us as possible. They solve the "size and encoding of wchar_t is
undefined" problem, but there are no _l() variants and we can't depend
on uselocale() being available. Probably wouldn't be much use to us
anyway considering our more complex and general transcoding
requirements, I just thought about this while contemplating
hypothetical pre-C23 systems that don't use UTF-32, specifically what
would break if such a system existed: probably nothing as long as you
don't use these. I guess another way you could tell would be if you
used the fancy new U-prefixed character/string literal syntax, but I
can't see much need for that.

In passing, we seem to have a couple of mentions of "pg_wchar_t"
(bogus _t) in existing comments.

[1] https://thephd.dev/c-the-improvements-june-september-virtual-c-meeting

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Florents Tselai 2025-10-25 04:38:58 Re: Feature: psql - display current search_path in prompt
Previous Message Srinath Reddy Sadipiralla 2025-10-25 01:50:52 Re: Making pg_rewind faster