Quick Links

Re: C11: should we use char32_t for unicode code points?

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: C11: should we use char32_t for unicode code points?
Date:	2025-10-26 19:43:01
Message-ID:	efd8b17d7f97bc7653dde565e274fef32f5da65c.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, 2025-10-25 at 16:21 +1300, Thomas Munro wrote:
> I
> guess you could static_assert(__STDC_UTF_32__, "char32_t must use
> UTF-32 encoding").

Done.

> It's also defined as at least, not exactly, 32
> bits but we already require the machine to have uint32_t so it must
> be
> exactly 32 bits for us, and we could static_assert(sizeof(char32_t)
> ==
> 4) for good measure.

What would be the problem if it were larger than 32 bits?

I don't mind adding the asserts, but it's slightly awkward because
StaticAssertDecl() isn't defined yet at the point we are including
uchar.h.

> IIUC you're proposing that all the stuff that only works when
> database
> encoding is UTF8 should be flipped over to the new type, and that
> seems like a really good idea to me: remaining uses of pg_wchar would
> be warnings that the encoding is only conditionally known.

Exactly. The idea is to make pg_wchar stand out more as a platform-
dependent (or encoding-dependent) representation, and remove the doubt
when someone sees char32_t.

> It'd be
> documentation without new type safety though: for example I think you
> missed a spot, the return type of the definition of utf8_to_unicode()
> (I didn't search exhaustively).

Right, it's not offering type safety. Fixed the omission.

> Do you consider explicit casts between eg pg_wchar and char32_t to be
> useful documentation for humans, when coercion should just work? I
> kinda thought we were trying to cut down on useless casts, they might
> signal something but can also hide bugs.

The patch doesn't add any explicit casts, except in to_char32() and
to_pg_wchar(), so I assume that the callsites of those functions are
what you meant by "explicit casts"?

We can get rid of those functions if you want. The main reason they
exist is for a place to comment on the safety of converting pg_wchar to
char32_t. I can put that somewhere else, though.

> Should the few places that
> deal in surrogates be using char16_t instead?

Yes, done.

> I wonder if the XXX_libc_mb() functions that contain our hard-coded
> assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> use your to_char32_t() too (probably with a longer name
> pg_wchar_to_char32_t() if it's in a header for wider use).

I don't think those functions do depend on UTF-32. iswalpha(), etc.,
take a wint_t, which is just a wchar_t that can also be WEOF.

And if we don't use to_char32/to_pg_wchar in there, I don't see much
need for it outside of pg_locale_builtin.c, but if the need arises we
can move it to a header file and give it a longer name.

> That'd
> highlight the exact points at which we make that assumption and
> centralise the assertion about database encoding, and then the code
> that compares with various known cut-off values would be clearly in
> the char32_t world.

The asserts about UTF-8 in pg_locale_libc.c are there because the
previous code only took those code paths for UTF-8, and I preserved
that. Also there is some code that depends on UTF-8 for decoding, but I
don't think anything in there depends on UTF-32 specifically.

> There is one small practical problem though: Apple hasn't got around
> to supplying <uchar.h> in its C SDK yet. It's there for C++ only,
> and
> isn't needed for the type in C++ anyway. I don't think that alone
> warrants a new name wart, as the standard tells us it must match
> uint32_least32_t so we can just define it ourselves if
> !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
> around to that.

Thank you, I added a configure test for uchar.h and some more
preprocessor logic in c.h.

> Since it confused me briefly: Apple does provide <unicode/uchar.h>
> but
> that's a coincidentally named ICU header, and on that subject I see
> that ICU hasn't adopted these types yet but there are some hints that
> they're thinking about it; meanwhile their C++ interfaces have begun
> to document that they are acceptable in a few template functions.

Even when they fully move to char32_t, we will still have to support
the older ICU versions for a long time.

> All other target systems have it AFAICS. Windows: tested by CI,
> MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.

Great, thank you!

> They solve the "size and encoding of wchar_t is
> undefined" problem

One thing I never understood about this is that it's our code that
converts from the server encoding to pg_wchar (e.g.
pg_latin12wchar_with_len()), so we must understand the representation
of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
understand the encoding of wchar_t, too, right?

> In passing, we seem to have a couple of mentions of "pg_wchar_t"
> (bogus _t) in existing comments.

Thank you. I'll fix that separately.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v2-0001-Use-C11-char16_t-and-char32_t-for-Unicode-code-po.patch	text/x-patch	56.1 KB

In response to

Re: C11: should we use char32_t for unicode code points? at 2025-10-25 03:21:28 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-10-26 20:16:20	Re: Should HashSetOp go away
Previous Message	Tom Lane	2025-10-26 18:00:17	Re: Should HashSetOp go away