Quick Links

Re: Differences in UTF8 between 8.0 and 8.1

From:	Paul Lindner <lindner(at)inuus(dot)com>
To:	andrew(at)supernews(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Differences in UTF8 between 8.0 and 8.1
Date:	2005-10-27 00:59:51
Message-ID:	20051027005951.GA27655@inuus.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>
> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
> was never actually a valid utf-8 string, and that the d2 b9 is only valid
> by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0
> utf-8 check was broken, but I didn't realize it was quite so bad.

Looking at the data it appears that it is a sequence of latin1
characters. They all have the eighth bit set and all seem to pass the
check.

In a million rows I found 2 examples of this.

However I'm running into another problem now. The command:

iconv -c -f UTF8 -t UTF8

does strip out the invalid characters. However, iconv reads the
entire file into memory before it writes out any data. This is not so
good for multi-gigabyte dump files and doesn't allow for it to be used
in a pipe between pg_dump and psql.

Anyone have any other recommendations? GNU recode might do it, but
I'm a bit stymied by the syntax. A quick perl script using
Text::Iconv didn't work either. I'm off to look at some other perl
modules and will try to create a script so I can strip out the invalid
characters.

--
Paul Lindner ||||| | | | | | | | | |
lindner(at)inuus(dot)com

In response to

Re: Differences in UTF8 between 8.0 and 8.1 at 2005-10-24 05:07:40 from Andrew - Supernews

Responses

Re: Differences in UTF8 between 8.0 and 8.1 at 2005-10-27 01:40:20 from Andrej Ricnik-Bay
Re: Differences in UTF8 between 8.0 and 8.1 at 2005-10-27 01:49:48 from Christopher Kings-Lynne
Re: Differences in UTF8 between 8.0 and 8.1 at 2005-10-27 11:56:02 from Andrew - Supernews

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Josh Berkus	2005-10-27 00:59:52	Re: Call for port reports
Previous Message	Tom Lane	2005-10-26 23:30:02	Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)