Re: Patch for bug #12845 (GB18030 encoding)

From: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for bug #12845 (GB18030 encoding)
Date: 2015-05-15 15:49:22
Message-ID: CAG6W84J+BJ0hEe1yrPL4bxVz-MaqCFdHkWRWVBiq8BaCoY8j3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> writes:
>> GB18030 is a special case, because it's a full mapping of all unicode
>> characters, and most of it is algorithmically defined.
>
> True.
>
>> This makes UtfToLocal a bad choice to implement it.
>
> I disagree with that conclusion. There are still 30000+ characters
> that need to be translated via lookup table, so we still need either
> UtfToLocal or a clone of it; and as I said previously, I'm not on board
> with cloning it.
>
>> I think the best solution is to get rid of UtfToLocal for GB18030. Use
>> a specialized algorithm:
>> - For characters > U+FFFF use the algorithm from my patch
>> - For charcaters <= U+FFFF use special mapping tables to map from/to
>> UTF32. Those tables would be smaller, and the code would be faster (I
>> assume).
>
> I looked at what wikipeda claims is the authoritative conversion table:
>
> http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
>
> According to that, about half of the characters below U+FFFF can be
> processed via linear conversions, so I think we ought to save table
> space by doing that. However, the remaining stuff that has to be
> processed by lookup still contains a pretty substantial number of
> characters that map to 4-byte GB18030 characters, so I don't think
> we can get any table size savings by adopting a bespoke table format.
> We might as well use UtfToLocal. (Worth noting in this connection
> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
> table entries for other encodings, even though most of the others
> are not concerned with characters outside the BMP.)
>

It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
uses a sparse array:

map = {{0, x}, {1, y}, {2, z}, ...}

v.s.

map = {x, y, z, ...}

That's fine when not every code point is used, but it's different for
GB18030 where almost all code points are used. Using a plain array
saves space and saves a binary search.

Gr. Arjen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-05-15 15:55:58 Re: i feel like compelled !
Previous Message Bruce Momjian 2015-05-15 15:24:30 Re: Changes to backup.sgml