| From: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
|---|---|
| To: | thomas(dot)munro(at)gmail(dot)com |
| Cc: | pgsql(at)j-davis(dot)com, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: C11: should we use char32_t for unicode code points? |
| Date: | 2025-10-28 08:36:13 |
| Message-ID: | 20251028.173613.18179479132562731.ishii@postgresql.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> The EUC family has direct encoding of 7-bit ASCII and then 3
> selectable character sets represented by sequences with the high bit
> set, with details varying between the Chinese (simplified Chinese),
> Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean
> variants. I don't know if the pg_wchar encoding we're producing in
> pg_euc*2wchar_with_len() has a name, but it doesn't appear to match
> the description of the standard "fixed" representation on the
> Wikipedia page for Extended Unix Code (it's too wide for starters,
> looking at the shift distances).
Yes. pg_euc*2wchar_with_len() creates "variable length" representation
of EUC, 1 byte to 4 bytes range per character. Then, expands each
character into pg_wchar. Also it can be converted back to the
multibyte representation easily.
Note that the standard "fixed" representation of EUC includes ASCII
range bytes in *non* ASCII characters, thus I think it is not easy to
use for backend safe encoding.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
| From | Date | Subject | |
|---|---|---|---|
| Next Message | torikoshia | 2025-10-28 08:43:49 | Re: RFC: Allow EXPLAIN to Output Page Fault Information |
| Previous Message | Bertrand Drouvot | 2025-10-28 08:13:06 | Consistently use the XLogRecPtrIsInvalid() macro |