Quick Links

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	a(dot)g(dot)nienhuis(at)gmail(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Date:	2015-03-10 21:33:47
Message-ID:	54FF633B.9090006@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On 03/09/2015 10:51 PM, a(dot)g(dot)nienhuis(at)gmail(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 12845
> Logged by: Arjen Nienhuis
> Email address: a(dot)g(dot)nienhuis(at)gmail(dot)com
> PostgreSQL version: 9.3.5
> Operating system: Ubuntu Linux
> Description:
>
> Step to reproduce:
>
> In psql:
>
> arjen=> select convert_to(chr(128512), 'GB18030');
>
> Actual output:
>
> ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding "UTF8"
> has no equivalent in encoding "GB18030"
>
> Expected output:
>
> convert_to
> ------------
> \x9439fc36
> (1 row)

Hmm, looks like our gb18030 <-> Unicode conversion table only contains
the Unicode BMP plane. Unicode points above 0xffff are not included.

If we added all the missing mappings as one to one mappings, like we've
done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the
missing mappings are in linear ranges that would be fairly simple to
handle in programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).

- Heikki

In response to

BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF at 2015-03-09 20:51:45 from a.g.nienhuis

Responses

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF at 2015-03-10 22:21:24 from Arjen Nienhuis

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Arjen Nienhuis	2015-03-10 22:21:24	Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF
Previous Message	a.g.nienhuis	2015-03-09 20:51:45	BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF