Re: invalid byte sequence for encoding "UNICODE": 0xd9

From: Eric Walstad <eric(at)ericwalstad(dot)com>
To: sfpug(at)postgresql(dot)org
Subject: Re: invalid byte sequence for encoding "UNICODE": 0xd9
Date: 2006-02-14 21:31:42
Message-ID: 200602141331.43643.eric@ericwalstad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: sfpug

On Monday 13 February 2006 21:29, David Fetter wrote:
> On Mon, Feb 13, 2006 at 02:37:45PM -0800, Eric Walstad wrote:
> > Hi everyone,
> >
> > Question: How do I keep from receiving the subject error message when
> > loading data?
>
> I suspect you'll have to pass iconv over the dump file, as mentioned
> in the release notes. You may have had the database encoded in that
> abomination hiding under the mask of SQL_ASCII, which isn't really an
> encoding. It's more like "any byte string without a null byte in it" :P
>
> HTH :)
>
> Cheers,
> D

Thanks for pointing me in the right direction, David.

I found the relevant section of the release notes here:
<http://www.postgresql.org/docs/current/interactive/release-8-1.html#AEN72739>

I first split my big dump file into managable chunks:

mkdir tmp
cd tmp
split -C 25000000 ../output.sql

Then I ran iconv on all the split files, using the command line suggested in
the release notes:

for SPLIT_FILE in xa*
do
iconv -f UTF-8 -t UTF-8 $SPLIT_FILE >> converted.sql
done

That, unfortunately, removed some other important bits of data (tabs, I think,
next to the invalid unicode characters). However, iconv did output messages
when it encountered the invalid characters (with byte offsets, I think) which
told me where the problems were located and in which split files. I was then
able to go into each split file and delete the characters by hand with vim,
cat all the split files back together and load all the data successfully.

My postgresql.conf file has the encoding line commented out:

#client_encoding = sql_ascii # actually, defaults to database encoding

No database encoding was specified when I created the database with createdb.
I suspect that means 'sql_ascii' was used, but I didn't find where the
default database encoding is specified so I don't know for sure.

Thanks again,

Eric.

In response to

Browse sfpug by date

  From Date Subject
Next Message Eric Walstad 2006-02-15 03:59:43 SQL assistance, please...
Previous Message David Fetter 2006-02-14 05:29:39 Re: invalid byte sequence for encoding "UNICODE": 0xd9