Re: GB18030-2022 Support in PostgreSQL

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: GB18030-2022 Support in PostgreSQL
Date: 2025-09-11 07:39:58
Message-ID: CANWCAZbM9Nex8A6BjpdmyHz44-1cizxvJ+zYsG7ikuxp2zJgYw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:

> I downloaded the tests from the referenced mail, but I cannot make the
tests to run. After extracting the 2 patch files, it added
src/test/encodings, but "make check" seems to not run them. I tried to copy
.out and .sql files to src/test/regress, but the tests still not running.
Did I miss anything?

Sorry, I'm not quite sure either how to get it to run like a normal test. I
got it to show the result by doing

psql -f src/test/encodings/sql/init.sql
psql -f src/test/encodings/sql/gb18030.sql > patch.out
diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff

I've attached what I got with the v5 patches, renamed to avoid being picked
up by CI.

>> The upstream correction to the 2000 version is not present in our
>> mappings, so we should mention that, unless it was reverted in or
>> before 2022.
>
>
> I think the upstream correction to the 2000 version is just a few not
round-trip chars that are ignored by us. So I feel we don't need to mention
them.

This is the commit, and both of these are in the 2022 file as a round trip
mapping. I don't see any mappings with non-zero flag in the 2000 file (in
any upstream commit).

https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5

We should mention this correction for completeness. It seems to just move
'ḿ' out of the private use area. To be sure, likely almost no one will
notice.

>> Your draft commit message had "9 characters are no longer required by
>> the new standard, but are retained in this patch for compatibility"
>> ...but those nine were introduced in the 2005 version, right? In which
>> case it doesn't affect us. Please confirm.
>
>
> I don't find any hint about if the 9 characters were introduced in the
2005 version.

Okay, I must have been confused by language "was included" in one of the
linked references, which doesn't necessarily mean they were introduced
there.

The 66 new mappings required are not in the 2022 UCM file and we already
cover them algorithmically in utf8_and_gb18030.c, so they already work
without this patch (see below, the glyphs render on my OS but maybe not
everyone can see them). The commit message needs to focus on what actually
changed for users (I'll work on that). Related information should be an
afterthought.

# SELECT convert_from(decode('82358F33', 'hex'), 'GB18030');
convert_from
--------------

(1 row)

# SELECT convert_from(decode('82359636', 'hex'), 'GB18030');
convert_from
--------------

(1 row)

While looking at utf8_and_gb18030.c, I see it refers to the XML file as the
source of the algorithmic ranges. We'll want to keep some reference to the
ranges independent of the XML file. I found

https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html

...which gives general info and mentions that U+10000 starts at
GB+90308130, and also links to

https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt

...which has the same ranges we have below U+10000. Links can always
disappear, but if the algorithmic ranges ever need to change (unlikely),
we'll have new information about that.

--
John Naylor
Amazon Web Services

Attachment Content-Type Size
v5-test.diff.nocfbot application/octet-stream 7.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2025-09-11 07:42:25 Re: OAuth client code doesn't work with Google OAuth
Previous Message Daniel Gustafsson 2025-09-11 07:37:39 Re: someone else to do the list of acknowledgments