Re: Bug with UTF-8 character

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org, eg(at)cybertec(dot)at
Subject: Re: Bug with UTF-8 character
Date: 2006-05-26 14:37:25
Message-ID: 20060526143725.GE27513@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 26, 2006 at 05:16:59PM +0300, Marko Kreen wrote:
> On 5/26/06, Martijn van Oosterhout <kleptog(at)svana(dot)org> wrote:
> >On Fri, May 26, 2006 at 08:21:56AM +0200, Hans-Jürgen Schönig wrote:
> >> I got a bug request for the following unicode character in PostgreSQL
> >> 8.1.4: 0xedaeb8
> >>
> >> ERROR: invalid byte sequence for encoding "UTF8": 0xedaeb8
>
> >Your character converts to char DBB8. According to the standard,
> >characters in the range D800-DFFF are not characters but surrogates.
> >They don't mean anything by themselves and are thus rejected by
> >postgres.
> >
> >http://www.unicode.org/faq/utf_bom.html#30
> >
> >This character should be preceded by a low surrogate (D800-DBFF). You
> >should combine the two into a single 4-byte UTF-8 character.
>
> You are talking about UTF16, not UTF8.

UTF-8 and UTF-16 use the same charater set as base, just the encoding
is different.

As that page says, to convert the surrogate pair in UTF-16 (D800 DC00)
to UTF-8, you have to combine them into a single 4-byte UTF-8
character. The direct encoding for D800 into UTF-8 is invalid because
no such character exists.

The OP apparently has some broken UTF-16 to UTF-8 conversion software
and thus produced invalid UTF-8, which postgres is rejecting. Given he
didn't post the other half of the surrogate, we don't actually know
what character he's trying to represent, so we can't help him with the
encoding. However, supplementary characters (which require surrogates
in UTF-16) are all in the range 0x10000 to 0x10FFFF.

If you don't beleive me, check the unicode database yourself (warning
large: 944KB).
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

DBB8 is a private use surrogate, maybe he should be using something in
the range E000-F8FF which are normal private use characters.

Have a ncie day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2006-05-26 14:43:01 Re: Updatable views/with check option parsing
Previous Message Tom Lane 2006-05-26 14:33:59 Re: Bug with UTF-8 character