| From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
| Cc: | Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: C11: should we use char32_t for unicode code points? |
| Date: | 2025-10-29 15:12:01 |
| Message-ID: | 024a6d53f246c87ee2796563f930a66fde1c0c0d.camel@j-davis.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
> I wonder if the logic to select the member/semantics could be turned
> into an enum in the encoding table, to make it even clearer, and then
> that could be used as an index into a table of ctype methods obejcts
> in _libc.c.
As long as we're able to isolate that logic in the libc provider,
that's reasonable. The other providers don't need that complexity, they
just need to decode straight to UTF-32.
> You showed char16_t for Windows, but we don't ever get char16_t out
> of
> wchar.c, it's always char32_t for UTF-8 input. It's just that
> _libc.c
> truncates to UTF-16 or short-circuits to avoid overflow on that
> platform (and in the past AIX 32-bit and maybe more), so it wouldn't
> belong in a hypothetical union or enum.
Oh, I see.
> >
> Perhaps we could at least put the conversion in a new encoding table
> function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
> place to put that sort of optimisation in
That sounds like a good step forward. And maybe one to convert to UTF-
32 for ICU, also?
> If we do develop this idea though, one issue to contemplate is that
> EUC code points might generate more than one wchar_t, looking at
> EUC_JIS_2004[1].
Wow, that's unfortunate.
Regards,
Jeff Davis
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Sami Imseih | 2025-10-29 15:24:17 | Re: another autovacuum scheduling thread |
| Previous Message | Ashutosh Bapat | 2025-10-29 14:55:16 | Re: Report bytes and transactions actually sent downtream |