Re: Trouble with UTF-8 data

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Janine Sisk <janine(at)furfly(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Trouble with UTF-8 data
Date: 2008-01-17 23:38:50
Message-ID: 16915.1200613130@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Janine Sisk <janine(at)furfly(dot)net> writes:
> But I'm still getting this error when loading the data into the new
> database:

> ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1

The reason PG doesn't like this sequence is that it corresponds to
a Unicode "surrogate pair" code point, which is not supposed to
ever appear in UTF-8 representation --- surrogate pairs are a kluge for
UTF-16 to deal with Unicode code points of more than 16 bits. See

http://en.wikipedia.org/wiki/UTF-16

I think you need a version of iconv that knows how to fold surrogate
pairs into proper UTF-8 form. It might also be that the data is
outright broken --- if this sequence isn't followed by another
surrogate-pair sequence then it isn't valid Unicode by anybody's
interpretation.

7.2.x unfortunately didn't check Unicode data carefully, and would
have let this data pass without comment ...

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2008-01-17 23:49:32 Re: [ADMIN] postgresql in FreeBSD jails: proposal
Previous Message Merlin Moncure 2008-01-17 23:33:44 Re: Accessing composite type columns from C