Re: GB18030-2022 Support in PostgreSQL

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: GB18030-2022 Support in PostgreSQL
Date: 2025-09-10 11:54:08
Message-ID: CAEoWx2=EoJYa8HRXaWOnoaD58MonkwN8pTKkc_7=oj6Zjs0P=w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,

Thank you very much for taking care of this patch.

John Naylor <johncnaylorls(at)gmail(dot)com> 于2025年9月10日周三 14:38写道:

>
> - The URL at the top currently points to a directory in Github, but v3
> changed it to point to the actual file. A directory can be navigated
> for inspection, so I used:
>
> 2000:
> https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
>
> 2022:
> https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/
>
>
Looks good.

> - I also made the regex a multiline regex for readability, even though
> the previous one was not.
>
>
Thank you very much for polishing the perl script. I am not an expert of
perl. I can make the script working, but not perfect.

> For 2022 version, I think it would be good to once run a test to
> verify that no mappings changed that we didn't expect. Perhaps the
> tests here can be used:
>
>
> https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi
>
>
I have manually run tested I had done before, everything works as expected.

I downloaded the tests from the referenced mail, but I cannot make the
tests to run. After extracting the 2 patch files, it added
src/test/encodings, but "make check" seems to not run them. I tried to copy
.out and .sql files to src/test/regress, but the tests still not running.
Did I miss anything?

The upstream correction to the 2000 version is not present in our
> mappings, so we should mention that, unless it was reverted in or
> before 2022.
>

I think the upstream correction to the 2000 version is just a few not
round-trip chars that are ignored by us. So I feel we don't need to mention
them.

>
> In the documentation (charset.sgml), do we want to mention the version
> e.g. the following?
>
> <entry><literal>GB18030</literal></entry>
> -<entry>National Standard</entry>
> +<entry>National Standard, version 2022</entry>
>

That's a good idea. I updated the sgml file:

[image: image.png]

>
> I've whacked around the commit messages, so those should be reviewed
> for accuracy.
>
> Your draft commit message had "9 characters are no longer required by
> the new standard, but are retained in this patch for compatibility"
> ...but those nine were introduced in the 2005 version, right? In which
> case it doesn't affect us. Please confirm.
>

I don't find any hint about if the 9 characters were introduced in the 2005
version.

But without this patch, they can be properly converted:
```
evantest=# SELECT encode(convert_from(decode('FD9D', 'hex'),
'GB18030')::bytea, 'hex');
encode
--------
efa5b9
(1 row)
```
So they should be available in the version 2002 already.

>
> "Author: Zheng Tao <taoz(at)highgo(dot)com>" -- I haven't seen any messages
> from this address in this thread, so could you confirm this was
> intentional?
>
>
Yes, Zheng Tao is my colleague. He worked with me for this patch, so I want
to credit him.

I am attaching v5 version. The only change is 0003, I added the SGML change.

Best regards,
Chao Li (Evan)
---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachment Content-Type Size
v5-0002-JCN-changes.patch application/octet-stream 2.4 KB
v5-0001-Generate-GB18030-mappings-from-the-Unicode-Consor.patch application/octet-stream 4.9 KB
v5-0003-Update-GB18030-encoding-from-version-2000-to-2022.patch application/octet-stream 456.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2025-09-10 11:59:56 Re: VM corruption on standby
Previous Message Alexander Kukushkin 2025-09-10 11:53:41 Re: issue with synchronized_standby_slots