Quick Links

Re: BUG #5532: Valid UTF8 sequence errors as invalid

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mike Lewis <mikelikespie(at)gmail(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5532: Valid UTF8 sequence errors as invalid
Date:	2010-06-30 18:21:33
Message-ID:	14170.1277922093@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Mike Lewis <mikelikespie(at)gmail(dot)com> writes:
> I've run into a fair amount of unicode errors when trying to copy in log
> files. Would you recommend using bytea or another data type instead of text
> or varchar... or at least copying to a staging table with bytea's and
> filtering out invalid rows when moving it to the main table?

My guess is that you're working with data that was originally
represented in UTF16, and you've used a tool that doesn't really know
what it's doing to convert to UTF8. A correct conversion has to reunite
surrogate pairs into wider-than-16-bit Unicode characters and then
encode those as single UTF8 sequences. Dunno if you can easily identify
the culprit, but fixing that conversion is the long-term solution.

(BTW, I should think that iconv or some related tool would have a
solution for fixing this miscoding; it's not an uncommon problem.)

regards, tom lane

In response to

Re: BUG #5532: Valid UTF8 sequence errors as invalid at 2010-06-30 18:05:24 from Mike Lewis

Responses

Re: BUG #5532: Valid UTF8 sequence errors as invalid at 2010-07-06 08:16:41 from Dimitri Fontaine

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Bidski	2010-06-30 22:23:35	Libpq.dll: File not recognized
Previous Message	Heikki Linnakangas	2010-06-30 18:14:11	Re: [BUGS] Server crash while trying to read expression using pg_get_expr()