Re: Patch for bug #12845 (GB18030 encoding)

From: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for bug #12845 (GB18030 encoding)
Date: 2015-05-15 11:13:33
Message-ID: CAG6W84+FnfkFj2G_OXeLEnUWAoK2QGHNmy-72E_Yza7J-LfvAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 14, 2015 at 11:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera
>>> <alvherre(at)2ndquadrant(dot)com> wrote:
>>>> Maybe not, but at the very least we should consider getting it fixed in
>>>> 9.5 rather than waiting a full development cycle. Same as in
>>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us
>>>> I'm not saying we MUST include it in 9.5, but we should at least
>>>> consider it. If we simply stash it in the open CF we guarantee that it
>>>> will linger there for a year.
>
>>> Sure, if somebody has the time to put into it now, I'm fine with that.
>>> I'm afraid it won't be me, though: even if I had the time, I don't
>>> know enough about encodings.
>
>> I concur that we should at least consider this patch for 9.5. I've
>> added it to
>> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items
>
> I looked at this patch a bit, and read up on GB18030 (thank you
> wikipedia). I concur we have a problem to fix. I do not like the way
> this patch went about it though, ie copying-and-pasting LocalToUtf and
> UtfToLocal and their supporting routines into utf8_and_gb18030.c.
> Aside from being duplicative, this means the improved mapping capability
> isn't available to use with anything except GB18030. (I do not know
> whether there are any linear mapping ranges in other encodings, but
> seeing that the Unicode crowd went to the trouble of defining a notation
> for it in http://www.unicode.org/reports/tr22/, I'm betting there are.)
>
> What I think would be a better solution, if slightly more invasive,
> is to extend LocalToUtf and UtfToLocal to add a callback function
> argument for a function of signature "uint32 translate(uint32)".
> This function, if provided, would be called after failing to find a
> mapping in the mapping table(s), and it could implement any translation
> that would be better handled by code than as a boatload of mapping-table
> entries. If it returns zero then it doesn't know a translation either,
> so throw error as before.
>
> An alternative definition that could be proposed would be to call the
> function before consulting the mapping tables, not after, on the grounds
> that the function can probably exit cheaply if the input's not in a range
> that it cares about. However, consulting the mapping table first wins
> if you have ranges that mostly work but contain a few exceptions: put
> the exceptions in the mapping table and then the function need not worry
> about handling them.
>
> Another alternative approach would be to try to define linear mapping
> ranges in a tabular fashion, for more consistency with what's there now.
> But that probably wouldn't work terribly well because the bytewise
> character representations used in this logic have to be converted into
> code points before you can do any sort of linear mapping. We could
> hard-wire that conversion for UTF8, but the conversion in the other code
> space would be encoding-specific. So we might as well just treat the
> whole linear mapping behavior as a black box function for each encoding.
>
> I'm also discounting the possibility that someone would want an
> algorithmic mapping for cases involving "combined" codes (ie pairs of
> UTF8 characters). Of the encodings we support, only EUC_JIS_2004 and
> SHIFT_JIS_2004 need such cases at all, and those have only a handful of
> cases; so it doesn't seem popular enough to justify the extra complexity.
>
> I also notice that pg_gb18030_verifier isn't even close to strict enough;
> it basically relies on pg_gb18030_mblen which contains no checks
> whatsoever on the third and fourth bytes. So that needs to be fixed.
>
> The verification tightening would definitely not be something to
> back-patch, and I'm inclined to think that the additional mapping
> capability shouldn't be either, in view of the facts that (a) we've
> had few if any field complaints yet, and (b) changing the signatures
> of LocalToUtf/UtfToLocal might possibly break third-party code.
> So I'm seeing this as a HEAD-only patch, but I do want to try to
> squeeze it into 9.5 rather than wait another year.
>
> Barring objections, I'll go make this happen.

GB18030 is a special case, because it's a full mapping of all unicode
characters, and most of it is algorithmically defined. This makes
UtfToLocal a bad choice to implement it. UtfToLocal assumes a sparse
array with only the defined characters. It uses binary search to find
a character. The 2 tables it uses now are huge (the .so file is 1MB).
Adding the rest of the valid characters to this scheme is possible,
but would make the problem worse.

I think fixing UtfToLocal only for the new characters is not optimal.

I think the best solution is to get rid of UtfToLocal for GB18030. Use
a specialized algorithm:
- For characters > U+FFFF use the algorithm from my patch
- For charcaters <= U+FFFF use special mapping tables to map from/to
UTF32. Those tables would be smaller, and the code would be faster (I
assume).

For example (256 KB):

unsigned int utf32_to_gb18030[65536] = {
/* 0x0 */ 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7,
/* 0x8 */ 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf,
--
/* 0xdb08 */ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
--
/* 0xfff0 */ 0x8431a334, 0x8431a335, 0x8431a336, 0x8431a337,
0x8431a338, 0x8431a339, 0x8431a430, 0x8431a431,
/* 0xfff8 */ 0x8431a432, 0x8431a433, 0x8431a434, 0x8431a435,
0x8431a436, 0x8431a437, 0x8431a438, 0x8431a439
};

Instead of (500KB):

static pg_utf_to_local ULmapGB18030[ 63360 ] = {
{0xc280, 0x81308130},
{0xc281, 0x81308131},
--
{0xefbfbe, 0x8431a438},
{0xefbfbf, 0x8431a439}
};

See the attachment for a python script to generate those mappings.

Gr. Arjen

Attachment Content-Type Size
gen_gb18030_table.py application/octet-stream 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-05-15 11:27:13 pgsql: Allow GiST distance function to return merely a lower-bound.
Previous Message Stephen Frost 2015-05-15 11:10:20 Re: pgsql: Add pg_audit, an auditing extension