Re: How well does PostgreSQL 9.6.1 support unicode?

From: James Zhou <james(at)360data(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Vick Khera <vivek(at)khera(dot)org>, pgsql-general <pgsql-general(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: How well does PostgreSQL 9.6.1 support unicode?
Date: 2016-12-21 23:24:09
Message-ID: CAGuREpMnw99Xu52jKwfBSfjMWxby3x=CYZab=1F+vjOkY+q3Mw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I figured out that I need to use the function CHR to enter supplementary
unicode characters (code points > FFFF, i.e. planes 1 - F), e.g.

insert into unicode(id, string) values(100, CHR(128120)); -- a mojo
character, https://unicodelookup.com/#0x1f478/1
insert into unicode(id, string) values(101, CHR(128121)); -- another mojo
insert into unicode(id, string) values(102, CHR(119071)); -- musical
symbol g clef ottava alta
insert into unicode(id, string) values(103, CHR(155648)); -- a very
infrequently used Chinese character
insert into unicode(id, string) values(104, CHR(155649)); -- another very
infrequently used Chinese character

the parameters are decimal representation of the code point values, e.g.
128120 is the decimal value of 1f478

The format U&'\03B1' only works for chars between 0000 - FFFF

When entered with CHR(), PostgreSQL gets their char_length() correctly, so
does substring() function.

Thank you all for help.

James

On Wed, Dec 21, 2016 at 8:31 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Vick Khera <vivek(at)khera(dot)org> writes:
> > On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI <
> > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >> A PostgreSQL database with encoding=UTF8 just accepts the whole
> >> range of Unicode, regardless that a character is defined for the
> >> code or not.
>
> > Interesting... when I converted my application and database to utf8
> > encoding, I discovered that Postgres is picky about UTF-8. Specifically
> the
> > UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d.
> > This looks like a proper character but in fact is not a defined character
> > code point.
>
> Well, we're picky to the extent that RFC 3629 tells us to be picky:
> http://www.faqs.org/rfcs/rfc3629.html
>
> The case you mention is rejected because it would be half of a UTF16
> "surrogate pair", which should not be used in any Unicode representation
> other than UTF16; if we allowed it then there would be more than one way
> to represent the same Unicode code point, which is undesirable for a lot
> of reasons.
>
> > So I think when you present an actual string of UTF8 encoded characters,
> > Postgres does refuse characters unknown. However, as you observe,
> inserting
> > the unicode code point directly does not produce an error:
>
> > insert into unicode(id, string) values(1, U&'\d80d');
> > INSERT 0 1
>
> Hm. I think that's a bug. The lexer does know that \d80d is half of a
> surrogate pair, and it expects the second half to come next. If you
> supply something that isn't the second half of a surrogate pair, you
> get an error as expected:
>
> u8=# insert into unicode(id, string) values(1, U&'\d80dfoo');
> ERROR: invalid Unicode surrogate pair at or near "foo'"
> LINE 1: insert into unicode(id, string) values(1, U&'\d80dfoo');
> ^
>
> But it looks like if you just end the string after the first half of a
> surrogate, it just drops the character without complaint. Notice that
> what got inserted was a zero-length string, not U+D08D:
>
> u8=# select *, length(string) from unicode;
> id | string | length
> ----+--------+--------
> 1 | | 0
> (1 row)
>
> I'd have expected a syntax error along the line of "incomplete Unicode
> surrogate pair". Peter, I think this was your code to begin with ---
> was it intentional to not raise error here, or is that an oversight?
>
> regards, tom lane
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rui Pacheco 2016-12-21 23:25:40 UTF-8 on Postgres wire protocol
Previous Message Adrian Klaver 2016-12-21 23:20:58 Re: Too long startup time after each crash.