Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

From: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
To: hlinnaka(at)iki(dot)fi
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Date: 2015-03-10 22:21:24
Message-ID: CAG6W84JZ-ZFhAM1GQzpVUOW8YM2gx6_-f4uCKU1j2sdmt+wO6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 10 Mar 2015 22:33, "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi> wrote:
>
> On 03/09/2015 10:51 PM, a(dot)g(dot)nienhuis(at)gmail(dot)com wrote:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference: 12845
>> Logged by: Arjen Nienhuis
>> Email address: a(dot)g(dot)nienhuis(at)gmail(dot)com
>> PostgreSQL version: 9.3.5
>> Operating system: Ubuntu Linux
>> Description:
>>
>> Step to reproduce:
>>
>> In psql:
>>
>> arjen=> select convert_to(chr(128512), 'GB18030');
>>
>> Actual output:
>>
>> ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding
"UTF8"
>> has no equivalent in encoding "GB18030"
>>
>> Expected output:
>>
>> convert_to
>> ------------
>> \x9439fc36
>> (1 row)
>
>
> Hmm, looks like our gb18030 <-> Unicode conversion table only contains
the Unicode BMP plane. Unicode points above 0xffff are not included.
>
> If we added all the missing mappings as one to one mappings, like we've
done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the missing
mappings are in linear ranges that would be fairly simple to handle in
programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).
>
> - Heikki

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

(Maybe at line 479 of conv.c:
https://github.com/postgres/postgres/blob/4baaf863eca5412e07a8441b3b7e7482b7a8b21a/src/backend/utils/mb/conv.c#L479
)

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2015-03-10 22:33:43 Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Previous Message Heikki Linnakangas 2015-03-10 21:33:47 Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF