Quick Links

Re: C11: should we use char32_t for unicode code points?

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	thomas(dot)munro(at)gmail(dot)com
Cc:	pgsql(at)j-davis(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: C11: should we use char32_t for unicode code points?
Date:	2025-10-28 08:36:13
Message-ID:	20251028.173613.18179479132562731.ishii@postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> The EUC family has direct encoding of 7-bit ASCII and then 3
> selectable character sets represented by sequences with the high bit
> set, with details varying between the Chinese (simplified Chinese),
> Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean
> variants. I don't know if the pg_wchar encoding we're producing in
> pg_euc*2wchar_with_len() has a name, but it doesn't appear to match
> the description of the standard "fixed" representation on the
> Wikipedia page for Extended Unix Code (it's too wide for starters,
> looking at the shift distances).

Yes. pg_euc*2wchar_with_len() creates "variable length" representation
of EUC, 1 byte to 4 bytes range per character. Then, expands each
character into pg_wchar. Also it can be converted back to the
multibyte representation easily.

Note that the standard "fixed" representation of EUC includes ASCII
range bytes in *non* ASCII characters, thus I think it is not easy to
use for backend safe encoding.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

In response to

Re: C11: should we use char32_t for unicode code points? at 2025-10-28 02:40:16 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	torikoshia	2025-10-28 08:43:49	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Previous Message	Bertrand Drouvot	2025-10-28 08:13:06	Consistently use the XLogRecPtrIsInvalid() macro