Re: C11: should we use char32_t for unicode code points?

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: C11: should we use char32_t for unicode code points?
Date: 2025-10-29 15:12:01
Message-ID: 024a6d53f246c87ee2796563f930a66fde1c0c0d.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
> I wonder if the logic to select the member/semantics could be turned
> into an enum in the encoding table, to make it even clearer, and then
> that could be used as an index into a table of ctype methods obejcts
> in _libc.c.

As long as we're able to isolate that logic in the libc provider,
that's reasonable. The other providers don't need that complexity, they
just need to decode straight to UTF-32.

> You showed char16_t for Windows, but we don't ever get char16_t out
> of
> wchar.c, it's always char32_t for UTF-8 input.  It's just that
> _libc.c
> truncates to UTF-16 or short-circuits to avoid overflow on that
> platform (and in the past AIX 32-bit and maybe more), so it wouldn't
> belong in a hypothetical union or enum.

Oh, I see.

> >
> Perhaps we could at least put the conversion in a new encoding table
> function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
> place to put that sort of optimisation in

That sounds like a good step forward. And maybe one to convert to UTF-
32 for ICU, also?

> If we do develop this idea though, one issue to contemplate is that
> EUC code points might generate more than one wchar_t, looking at
> EUC_JIS_2004[1].

Wow, that's unfortunate.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-10-29 15:24:17 Re: another autovacuum scheduling thread
Previous Message Ashutosh Bapat 2025-10-29 14:55:16 Re: Report bytes and transactions actually sent downtream