From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net> |
Subject: | Re: GB18030-2022 Support in PostgreSQL |
Date: | 2025-08-11 08:22:09 |
Message-ID: | 2F92C344-A707-44D0-A718-670E30B2C1DF@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi John,
Thanks for your review.
Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small:
```diff - omit the comment part
> <U20AC> \x80 |3
> <U3000> \xA3\xA0 |3
> <UE5E5> \xA3\xA0 |4
>
28067a28099,28114
> <U9FB4> \xFE\x59 |0
> <U9FB4> \x82\x35\x90\x37 |3
> <U9FB5> \xFE\x61 |0
> <U9FB5> \x82\x35\x90\x38 |3
> <U9FB6> \xFE\x66 |0
> <U9FB6> \x82\x35\x90\x39 |3
> <U9FB7> \xFE\x67 |0
> <U9FB7> \x82\x35\x91\x30 |3
> <U9FB8> \xFE\x6D |0
> <U9FB8> \x82\x35\x91\x31 |3
> <U9FB9> \xFE\x7E |0
> <U9FB9> \x82\x35\x91\x32 |3
> <U9FBA> \xFE\x90 |0
> <U9FBA> \x82\x35\x91\x33 |3
> <U9FBB> \xFE\xA0 |0
> <U9FBB> \x82\x35\x91\x34 |3
29577c29624
< <UE5E5> \xA3\xA0 |0
---
> # <UE5E5> \xA3\xA0 |0
30001,30010c30048,30057
< <UE78D> \xA6\xD9 |0
< <UE78E> \xA6\xDA |0
< <UE78F> \xA6\xDB |0
< <UE790> \xA6\xDC |0
< <UE791> \xA6\xDD |0
< <UE792> \xA6\xDE |0
< <UE793> \xA6\xDF |0
< <UE794> \xA6\xEC |0
< <UE795> \xA6\xED |0
< <UE796> \xA6\xF3 |0
---
> <UE78D> \xA6\xD9 |1
> <UE78E> \xA6\xDA |1
> <UE78F> \xA6\xDB |1
> <UE790> \xA6\xDC |1
> <UE791> \xA6\xDD |1
> <UE792> \xA6\xDE |1
> <UE793> \xA6\xDF |1
> <UE794> \xA6\xEC |1
> <UE795> \xA6\xED |1
> <UE796> \xA6\xF3 |1
30146c30193
< <UE81E> \xFE\x59 |0
---
> <UE81E> \xFE\x59 |1
30154c30201
< <UE826> \xFE\x61 |0
---
> <UE826> \xFE\x61 |1
30159,30160c30206,30207
< <UE82B> \xFE\x66 |0
< <UE82C> \xFE\x67 |0
---
> <UE82B> \xFE\x66 |1
> <UE82C> \xFE\x67 |1
30166c30213
< <UE832> \xFE\x6D |0
---
> <UE832> \xFE\x6D |1
30183c30230
< <UE843> \xFE\x7E |0
---
> <UE843> \xFE\x7E |1
30200c30247
< <UE854> \xFE\x90 |0
---
> <UE854> \xFE\x90 |1
30216c30263
< <UE864> \xFE\xA0 |0
---
> <UE864> \xFE\xA0 |1
30470a30518,30537
> <UFE10> \xA6\xD9 |0
> <UFE10> \x84\x31\x82\x36 |3
> <UFE11> \xA6\xDB |0
> <UFE11> \x84\x31\x82\x37 |3
> <UFE12> \xA6\xDA |0
> <UFE12> \x84\x31\x82\x38 |3
> <UFE13> \xA6\xDC |0
> <UFE13> \x84\x31\x82\x39 |3
> <UFE14> \xA6\xDD |0
> <UFE14> \x84\x31\x83\x30 |3
> <UFE15> \xA6\xDE |0
> <UFE15> \x84\x31\x83\x31 |3
> <UFE16> \xA6\xDF |0
> <UFE16> \x84\x31\x83\x32 |3
> <UFE17> \xA6\xEC |0
> <UFE17> \x84\x31\x83\x33 |3
> <UFE18> \xA6\xED |0
> <UFE18> \x84\x31\x83\x34 |3
> <UFE19> \xA6\xF3 |0
> <UFE19> \x84\x31\x83\x35 |3
```
As you can see, the changes only reflect to the changed 18 characters plus other 3 unicode points (U20AC, U3000, UE5E5). My code comment in UCS_to_GB18030.pl has explained these changes:
```code comment from UCS_to_GB18030.pl
# The |n is a flag, where n has values of 0, 1, 3, 4.
# With a refeence to https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132,
# the flag should mean the following:
# 0 - round-trip mapping
# 1 - there are 18 mappings with flag 1, those are mapping changes
# from GB180303-2000 to GB18030-2022. Old mappings are marked
# with flag 1, new mappings with flag 0. So we can ignore all
# mappings with flag 0.
# 3 - there are 20 mappings with flag 3:
# 18 of them reflect to the 18 mappings with flag 1, but means
# the old mapping's unicode's new mapping with GB18030-2022.
# These 18 new mappings have no actual glyphs in GB18030-2022.
# So we can ignore these 18 mappings with flag 3.
# The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3".
# They are two reserved fallbacks for compatibility with GBK and
# other web data as in WHATWG. Both U20AC and U3000 have round-
# trip mappings in GB18030-2022, so we can ignore these two
# mappings with flag 3.
# So, we can ignore all mappings with flag 3.
# 4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4.
# This is a "good one-way" mapping from U+E5E5 to \xA3\xA0
# for maximum compatibility with previous behavior. So we can
# ignore this mapping as well.
```
For your question:
> "9 characters are no longer required by the new standard, but are
> retained in this patch for compatibility"
>
> How is that done?
The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:
<UF92C> \xFD\x9C |0
Still appears in 2022.ucm, so that this character is retained.
Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/
> On Aug 11, 2025, at 13:50, John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>
>> I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch.
>>
>> BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as:
>
> "9 characters are no longer required by the new standard, but are
> retained in this patch for compatibility"
>
> How is that done?
>
>> I added a test case with a mapping changed char, and the test passes:
>>
>> % make check
>> ...
>> # All 229 tests passed.
>>
>> For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
>>
>> I am attaching the patch file.
>
> Going from the old .xml file to the .ucm file makes it difficult to
> see the relevant changes. Also, there are nearly 1000 non-user-visible
> changes like this in the output file that are not explained:
>
> - /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
> + /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/
>
> The 2000 version is available in the .ucm format, so maybe converting
> to that first would be a good preparatory patch:
>
> https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm
>
> Looking at the history, it looks like that file has seen small
> revisions, so it may take some research to get the exact equivalent to
> the XML file we use. That will also tell us if anything will change
> for us besides the actual 2022 revision.
>
> --
> John Naylor
> Amazon Web Services
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2025-08-11 08:25:12 | Re: Skipping schema changes in publication |
Previous Message | John Naylor | 2025-08-11 08:11:21 | Re: [PATCH] Refactor bytea_sortsupport(), take two |