Re: GB18030-2022 Support in PostgreSQL

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: GB18030-2022 Support in PostgreSQL
Date: 2025-08-12 04:57:45
Message-ID: CANWCAZZ129LpH3Z+i1q+aE-X6fNNg0FYF1fRK0pd2AEpSM8hmw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 12, 2025 at 9:09 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:

[bringing this back to the original thread]

> So, I compared 2000 ucm with 2005 ucm also compared 2005 ucm with 2022 ucm. Then I found that some changed in 2005 is reverted in 2022, that why diff between 2000 and 2022 is small. For example, the following mappings

Yes, this was mentioned in the "disruptive changes" document linked in
my first email in this thread:

"The 2005 edition included 6 characters with double mappings. The 2022
edition removes the
double mappings.
The 2005 edition included 9 characters from the CJK Compatibility
Ideographs block. In
Unicode/10646, these all have canonical decomposition mappings to
characters in the URO. In
the 2022 edition, these nine compatibility characters are removed."

> So, for how to create patch 2, I think we have 3 options:
>
> 1. As planned, update to the latest version of 2000 ucm, then skip 2005 and directly upgrade to 2022 in patch 3. This way, we just honor 2000 ucm regardless that the change is actually introduced by 2005.
>
> 2. Skip the latest version of 2000 ucm and upgrade to 2005 ucm. This way will clearly show the upgrade path 2000->2005->2022. Downside is that 2005 introduced some changes that are reverted in 2022, which will cause some unnecessary changes in map files.
>
> 3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files.

#3 is what I had in mind to begin with unless we found some reason not
to. Minimizing churn is a lucky side effect that reinforces that
choice.

Before getting to that, I thought I'd bring this up to the community:

+# Copyright (C) 2000-2009, International Business Machines
Corporation and others.
+# All Rights Reserved.

The previous XML file didn't contain a copyright notice -- does anyone
want to make a case for not checking unicode-org's source file into
our tree because of this? The 2022 update changes it to

# Copyright (C) 2016 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
# Copyright (C) 2000-2012, International Business Machines Corporation
and others.
# All Rights Reserved.

...and the above links to https://www.unicode.org/license.txt

--
John Naylor
Amazon Web Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-08-12 05:06:47 Re: index prefetching
Previous Message Michael Paquier 2025-08-12 04:41:40 Re: Possible inaccurate description of wal_compression in docs