Re: C11: should we use char32_t for unicode code points?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: C11: should we use char32_t for unicode code points?
Date: 2025-10-28 20:03:33
Message-ID: CA+hUKGLWggvAW+ZK=P1ZoUBgS8EhodpA7ipeGuq2-3HePjjXDw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 29, 2025 at 7:45 AM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> On 26.10.25 20:43, Jeff Davis wrote:
> > +/*
> > + * char16_t and char32_t
> > + * Unicode code points.
> > + */
> > +#ifndef __cplusplus
> > +#ifdef HAVE_UCHAR_H
> > +#include <uchar.h>
> > +#ifndef __STDC_UTF_16__
> > +#error "char16_t must use UTF-16 encoding"
> > +#endif
> > +#ifndef __STDC_UTF_32__
> > +#error "char32_t must use UTF-32 encoding"
> > +#endif
> > +#else
> > +typedef uint16_t char16_t;
> > +typedef uint32_t char32_t;
> > +#endif
> > +#endif
>
> This could be improved a bit. The reason for some of these conditionals
> is not clear. Like, what does __cplusplus have to do with this? I
> think it would be more correct to write a configure/meson check for the
> actual types rather than depend indirectly on a header check.

I suggested testing __cplusplus because I predicted that that typedef
would fail on a C++ compiler (since C++11), where char32_t is a
language keyword identifying a distinct type requiring no #include.
This is an Apple-only problem, without which we could just include
<uchar.h> unconditionally, and presumably will eventually when Apple
supplies this non-optional-per-C11 header. On a Mac, #include
<uchar.h> fails for C (there is no $SDK/usr/include/uchar.h) but works
for C++ (it finds $SDK/usr/include/c++/v1/uchar.h), and since we'd
probe for HAVE_UCHAR_H with the C compiler, we'd not find it and thus
also need to exclude __cplusplus at compile time. Otherwise, let's
see what the error looks like...

test.cpp:2:22: error: cannot combine with previous 'int' declaration specifier
2 | typedef unsigned int char32_t;
| ^
test.cpp:2:1: warning: typedef requires a name [-Wmissing-declarations]
2 | typedef unsigned int char32_t;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.

GCC has a clearer message:

test.cpp:2:22: error: redeclaration of C++ built-in type 'char32_t'
[-fpermissive]
2 | typedef unsigned int char32_t;
| ^~~~~~~~

If you try to test for the existence of the type rather than the
header in meson/configure, won't you still have the configure-with-C
compile-with-C++ problem, with no way to resolve it except by keeping
the test for __cplusplus that you're trying to get rid of? So what do
you gain other than more lines of configure stuff?

Out of curiosity, even with -std=C++03 (old C++ standard that might
not work for PostgreSQL for other reasons, but I wanted to see what
would happen with a standard before char32_t became a fundamental
language type) I was surprised to see that the standard library
supplied char32_t. It incorrectly(?) imports a typename from the
future standards using an internal type, so our typedef still fails,
just with a different Clang error:

test.cpp:2:22: error: typedef redefinition with different types
('unsigned int' vs 'char32_t')
2 | typedef unsigned int char32_t;
| ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1/__config:320:20:
note: previous definition is here
320 | typedef __char32_t char32_t;
| ^

> The checks for __STDC_UTF_16__ and __STDC_UTF_32__ can be removed, as
> was discussed elsewhere, since we don't use any standard library
> functions that make use of these facts, and the need goes away with C23
> anyway.

+1

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-10-28 20:17:08 Re: apply_scanjoin_target_to_paths and partitionwise join
Previous Message Jeff Davis 2025-10-28 20:03:17 Re: C11: should we use char32_t for unicode code points?