Re: How well does PostgreSQL 9.6.1 support unicode?

From: Vick Khera <vivek(at)khera(dot)org>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How well does PostgreSQL 9.6.1 support unicode?
Date: 2016-12-21 14:08:31
Message-ID: CALd+dcfA2-p2CquiokLPxQKWzFP-ggtQ7uqcab3ozYsdajkGAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> A PostgreSQL database with encoding=UTF8 just accepts the whole
> range of Unicode, regardless that a character is defined for the
> code or not.

Interesting... when I converted my application and database to utf8
encoding, I discovered that Postgres is picky about UTF-8. Specifically the
UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d.
This looks like a proper character but in fact is not a defined character
code point.

Given the above unicode table:

insert into unicode(id, string) values(1, E'\xed\xa0\x8d');
ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0x8d

So I think when you present an actual string of UTF8 encoded characters,
Postgres does refuse characters unknown. However, as you observe, inserting
the unicode code point directly does not produce an error:

insert into unicode(id, string) values(1, U&'\d80d');
INSERT 0 1

I discovered this when that specific byte sequence was found in my database
during the conversion. I have no idea what my customer entered in the form
to make that sequence, but it was part of the Vietnamese spelling of Ho Chi
Minh City as best I could figure.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vick Khera 2016-12-21 14:12:13 Re: Request to share approach during REINDEX operation
Previous Message Yogesh Sharma 2016-12-21 14:00:10 Re: Request to share approach during REINDEX operation