Re: GB18030-2022 Support in PostgreSQL

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: GB18030-2022 Support in PostgreSQL
Date: 2025-09-29 08:19:48
Message-ID: CAEoWx2=BWDFXpB9OhfoKJGsU-Lk+7oQ8SW7a5GyoufLiFTWO8g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 29, 2025 at 12:03 PM John Naylor <johncnaylorls(at)gmail(dot)com>
wrote:

> On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> > I am not sure if you should also upgrade the UCM file to 2022 version,
> but if we need, let’s do it with a separate commit.
>
> If they can all use the same file, we should just do that for the sake
> of simplicity, in which case a separate commit is just extra noise.
>
>
In v3, I have updated EUC_CN to use gb18030-2022.ucm. Fortunately, the map
files are unchanged, so we don't have to do much testing for EUC_CN.

For UHC, in the icu master branch
https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings,
there is still windows-949-2000.ucm, thus only download URL is changed,
file content is unchanged.

```
% make utf8_to_uhc.map utf8_to_euc_cn.map
wget -O windows-949-2000.ucm --no-use-server-timestamps
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm
--2025-09-29 16:00:40--
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm
HTTP request sent, awaiting response... 200 OK
Length: 356253 (348K) [text/plain]
Saving to: ‘windows-949-2000.ucm’

windows-949-2000.ucm
100%[=========================================================================================================>]
347.90K 222KB/s in 1.6s

2025-09-29 16:00:43 (222 KB/s) - ‘windows-949-2000.ucm’ saved
[356253/356253]

'/usr/bin/perl' -I . UCS_to_UHC.pl
- Writing UTF8=>UHC conversion table: utf8_to_uhc.map
- Writing UHC=>UTF8 conversion table: uhc_to_utf8.map
wget -O gb18030-2022.ucm --no-use-server-timestamps
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm
--2025-09-29 16:00:43--
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm
HTTP request sent, awaiting response... 200 OK
Length: 675312 (659K) [text/plain]
Saving to: ‘gb18030-2022.ucm’

gb18030-2022.ucm
100%[=========================================================================================================>]
659.48K 1.33MB/s in 0.5s

2025-09-29 16:00:44 (1.33 MB/s) - ‘gb18030-2022.ucm’ saved [675312/675312]

'/usr/bin/perl' -I . UCS_to_EUC_CN.pl
- Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map
- Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map
% git diff
%
```

Please note, I didn't include the deletion of gb-18030-2000.xml in v3,
because that will cause the patch file to be too big, thus requiring an
approval process for the email to land in the Mail Archive. Please delete
the xml file when you push the commit.

Best regards,
Chao Li (Evan)
---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachment Content-Type Size
v3-0001-Generate-EUC_CN-and-UHC-mappings-from-the-Unicode.patch application/octet-stream 5.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2025-09-29 08:45:27 Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Previous Message Michael Paquier 2025-09-29 08:01:37 Re: Fix locking issue with fixed-size stats template in injection_points