pgsql: Extend GB18030 encoding conversion to cover full Unicode range.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Extend GB18030 encoding conversion to cover full Unicode range.
Date: 2015-05-15 19:02:28
Message-ID: E1YtKsO-0002LE-Pw@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Extend GB18030 encoding conversion to cover full Unicode range.

Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
points up to U+FFFF, but the actual spec defines conversions for all code
points up to U+10FFFF. That would be rather impractical as a lookup table,
but fortunately there is a simple algorithmic conversion between the
additional code points and the equivalent GB18030 byte patterns. Make use
of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
additional conversions.

Having created the infrastructure to do that, we can use the same code to
map certain linearly-related subranges of the Unicode space below U+FFFF,
allowing removal of the corresponding lookup table entries. This more
than halves the lookup table size, which is a substantial savings;
utf8_and_gb18030.so drops from nearly a megabyte to about half that.

In support of doing that, replace ISO10646-GB18030.TXT with the data file
gb-18030-2000.xml (retrieved from
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
in which these subranges have been deleted from the simple lookup entries.

Per bug #12845 from Arjen Nienhuis. The conversion code added here is
based on his proposed patch, though I whacked it around rather heavily.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/8d3e0906df5496b853cc763f87b9ffd2ae27adbe

Modified Files
--------------
src/backend/utils/mb/Unicode/ISO10646-GB18030.TXT |63488 --------------------
src/backend/utils/mb/Unicode/Makefile | 8 +-
src/backend/utils/mb/Unicode/UCS_to_GB18030.pl | 81 +-
src/backend/utils/mb/Unicode/gb-18030-2000.xml |30916 ++++++++++
src/backend/utils/mb/Unicode/gb18030_to_utf8.map |32633 +---------
src/backend/utils/mb/Unicode/utf8_to_gb18030.map |32631 +---------
.../utf8_and_gb18030/utf8_and_gb18030.c | 159 +-
7 files changed, 31111 insertions(+), 128805 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Simon Riggs 2015-05-15 19:16:57 pgsql: TABLESAMPLE system_rows(limit)
Previous Message Robert Haas 2015-05-15 18:46:08 pgsql: doc: CREATE FOREIGN TABLE now allows CHECK ( ... ) NO INHERIT