Re: Patch for bug #12845 (GB18030 encoding)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for bug #12845 (GB18030 encoding)
Date: 2015-05-15 14:10:18
Message-ID: 19727.1431699018@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> writes:
> GB18030 is a special case, because it's a full mapping of all unicode
> characters, and most of it is algorithmically defined.

True.

> This makes UtfToLocal a bad choice to implement it.

I disagree with that conclusion. There are still 30000+ characters
that need to be translated via lookup table, so we still need either
UtfToLocal or a clone of it; and as I said previously, I'm not on board
with cloning it.

> I think the best solution is to get rid of UtfToLocal for GB18030. Use
> a specialized algorithm:
> - For characters > U+FFFF use the algorithm from my patch
> - For charcaters <= U+FFFF use special mapping tables to map from/to
> UTF32. Those tables would be smaller, and the code would be faster (I
> assume).

I looked at what wikipeda claims is the authoritative conversion table:

http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

According to that, about half of the characters below U+FFFF can be
processed via linear conversions, so I think we ought to save table
space by doing that. However, the remaining stuff that has to be
processed by lookup still contains a pretty substantial number of
characters that map to 4-byte GB18030 characters, so I don't think
we can get any table size savings by adopting a bespoke table format.
We might as well use UtfToLocal. (Worth noting in this connection
is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
table entries for other encodings, even though most of the others
are not concerned with characters outside the BMP.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2015-05-15 14:42:16 Re: Changes to backup.sgml
Previous Message Tom Lane 2015-05-15 13:44:21 Re: best place for "rtree" strategy numbers