From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, JiaoShuntian <jiaoshuntian(at)highgo(dot)com(dot)w(dot)kunlunaq(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: GB18030-2022 Support in PostgreSQL |
Date: | 2025-08-07 08:14:44 |
Message-ID: | CAEoWx2mvqeC0Qmcf5UqYhG1OWe5Mjie15nD-0owNr+4zQF6eTA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I did more researches about the changes in 2022 over 2000, here is a
summary:
* 66 new characters have been added in 2022. All these are 4 bytes
characters. As the map files store only 2 bytes GB code mappings, 4 bytes
GB code mapping are calculated, thus these chars can be properly
encoded/decoded without this patch, I tested that.
* 9 characters are no longer required by 2022, but application may decide
to retain them or not. As the ucm file (
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm)
retains them, we also retain them.
* Unicode mappings for 18 characters have changed. Only these changes will
cause backward compatibility issues. However, half of them are rarely
used punctuation
marks and rests are glyphs that I cannot recognize as a native Chinese
speaker. So these changes should not significantly impact most
existing databases.
I added a test case with a mapping changed char, and the test passes:
% make check
...
# All 229 tests passed.
For more details on the standard change, see
https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
I am attaching the patch file.
Chao Li (Evan)
---------------------
Highgo Software Co., Ltd.
https://www.highgo.com/
John Naylor <johncnaylorls(at)gmail(dot)com> 于2025年8月5日周二 18:25写道:
> On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> >
> > 2025年8月4日 21:51,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > So on the whole I'd lean a bit towards just redefining GB18030 as
> > meaning the new standard. The fact that we don't support it as a
> > server-side encoding perhaps makes that idea more tenable than it
> > would be if the encoding governed the interpretation of our own
> > stored data.
>
> > I agree with Tom that we may just redefine GB18030 to comply with the
> 2022 standard.
> >
> > As John Naylor pointed, 2022 is not backward compatible, that is true.
> However, I went through all the incompatible changes, those are all
> characters rarely used.
>
> If that's the case than redefining is probably okay.
>
> > One use case I am thinking is that, say a database uses default encoding
> (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022
> since version 73.1.
>
> ICU locales can only be used with sever-side encodings.
>
> > At the time when the new version is released, if some third party
> migration tools are known working fine, the release note may recommend the
> tools.
>
> I highly doubt such a large hammer will be necessary. Whatever advice
> we give for discovery and conversion of affected text is our
> responsibility and can be in the form of example queries.
>
> --
> John Naylor
> Amazon Web Services
>
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Upgrade-GB18030-encoding-support-from-2000-to-202.patch | application/octet-stream | 2.0 MB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bertrand Drouvot | 2025-08-07 08:17:26 | Re: Adding per backend commit and rollback counters |
Previous Message | shveta malik | 2025-08-07 08:13:40 | Re: Proposal: Conflict log history table for Logical Replication |