From: | John Naylor <johncnaylorls(at)gmail(dot)com> |
---|---|
To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net> |
Subject: | Re: GB18030-2022 Support in PostgreSQL |
Date: | 2025-09-11 07:39:58 |
Message-ID: | CANWCAZbM9Nex8A6BjpdmyHz44-1cizxvJ+zYsG7ikuxp2zJgYw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> I downloaded the tests from the referenced mail, but I cannot make the
tests to run. After extracting the 2 patch files, it added
src/test/encodings, but "make check" seems to not run them. I tried to copy
.out and .sql files to src/test/regress, but the tests still not running.
Did I miss anything?
Sorry, I'm not quite sure either how to get it to run like a normal test. I
got it to show the result by doing
psql -f src/test/encodings/sql/init.sql
psql -f src/test/encodings/sql/gb18030.sql > patch.out
diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff
I've attached what I got with the v5 patches, renamed to avoid being picked
up by CI.
>> The upstream correction to the 2000 version is not present in our
>> mappings, so we should mention that, unless it was reverted in or
>> before 2022.
>
>
> I think the upstream correction to the 2000 version is just a few not
round-trip chars that are ignored by us. So I feel we don't need to mention
them.
This is the commit, and both of these are in the 2022 file as a round trip
mapping. I don't see any mappings with non-zero flag in the 2000 file (in
any upstream commit).
https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5
We should mention this correction for completeness. It seems to just move
'ḿ' out of the private use area. To be sure, likely almost no one will
notice.
>> Your draft commit message had "9 characters are no longer required by
>> the new standard, but are retained in this patch for compatibility"
>> ...but those nine were introduced in the 2005 version, right? In which
>> case it doesn't affect us. Please confirm.
>
>
> I don't find any hint about if the 9 characters were introduced in the
2005 version.
Okay, I must have been confused by language "was included" in one of the
linked references, which doesn't necessarily mean they were introduced
there.
The 66 new mappings required are not in the 2022 UCM file and we already
cover them algorithmically in utf8_and_gb18030.c, so they already work
without this patch (see below, the glyphs render on my OS but maybe not
everyone can see them). The commit message needs to focus on what actually
changed for users (I'll work on that). Related information should be an
afterthought.
# SELECT convert_from(decode('82358F33', 'hex'), 'GB18030');
convert_from
--------------
龦
(1 row)
# SELECT convert_from(decode('82359636', 'hex'), 'GB18030');
convert_from
--------------
鿯
(1 row)
While looking at utf8_and_gb18030.c, I see it refers to the XML file as the
source of the algorithmic ranges. We'll want to keep some reference to the
ranges independent of the XML file. I found
...which gives general info and mentions that U+10000 starts at
GB+90308130, and also links to
https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt
...which has the same ranges we have below U+10000. Links can always
disappear, but if the algorithmic ranges ever need to change (unlikely),
we'll have new information about that.
--
John Naylor
Amazon Web Services
Attachment | Content-Type | Size |
---|---|---|
v5-test.diff.nocfbot | application/octet-stream | 7.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Zsolt Parragi | 2025-09-11 07:42:25 | Re: OAuth client code doesn't work with Google OAuth |
Previous Message | Daniel Gustafsson | 2025-09-11 07:37:39 | Re: someone else to do the list of acknowledgments |