Quick Links

Importing data - possible UTF8 import bug?

From:	"Mikel Lindsaar" <raasdnil(at)gmail(dot)com>
To:	pgsql-admin(at)postgresql(dot)org
Subject:	Importing data - possible UTF8 import bug?
Date:	2008-07-11 02:02:01
Message-ID:	57a815bf0807101902o193644ear50e211e67663d874@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

Hi all,

I am importing a bunch of data exported from an Oracle database.

This came out as a flat file, tab delimited.

I have parsed this through a ruby script that does the following:

For each line
split it at the tab
for each column
parse it through Iconv and convert from UTF8 to UTF8
on failure dump the line into a log for inspection and continue
check to see if the column value ends in a single \
if so, add another \ to escape it and not the tab that will follow
check to see if the column value is blank
if so, replace with \N
join the array back together with tabs
write it out to the import file.

This works quite well. The ruby script handles about 10,000 lines
every 15 seconds or so. Theoretically no invalid utf-8 sequences
could end up in the output file as everything is passed through Iconv
and if it raises an except it dumps the offending line into a separate
file for manual handling.

But I am hitting a problem. When I try to import the result with COPY
I am getting:

ERROR: invalid byte sequence for encoding "UTF8": 0xa2

And gives me a line reference.

I then grabbed that line out of the file, unpacked each byte into a
decimal representation and 162 does not appear anywhere.

Additionally I did:

Iconv.iconv('UTF-8', 'UTF-8', line).join == line

Which takes the line and converts it from UTF8 to UTF8 using the Iconv
library (--f UTF-8 -t UTF-8) and this returns true.

Interestingly, the following also returns true:

Iconv.iconv('ASCII', 'UTF-8', line).join == line #=> true

So there can't be any non ASCII characters in there.

Unpacking the line and uniq'ing it and sorting it gives the following:

@result.unpack('U'*(at)result(dot)length).uniq.sort
=> [9, 10, 32, 45, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 65, 66,
67, 68, 69, 70, 72, 73, 74, 76, 77, 78, 79, 80, 82, 83, 84, 85, 87,
91, 92, 93, 97, 98, 99, 101, 109, 110, 111, 114, 116, 117, 121]

Anyone got any ideas how I can hunt down what is in this row, visually
looking at it hasn't helped?

Mikel

--
http://lindsaar.net/
Rails, RSpec and Life blog....

Responses

Re: Importing data - possible UTF8 import bug? at 2008-07-11 02:56:47 from Mikel Lindsaar
Re: Importing data - possible UTF8 import bug? at 2008-07-11 03:32:38 from Tom Lane
Re: Importing data - possible UTF8 import bug? at 2008-07-11 08:37:39 from Mikel Lindsaar

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Mikel Lindsaar	2008-07-11 02:56:47	Re: Importing data - possible UTF8 import bug?
Previous Message	Tom Lane	2008-07-10 16:21:44	Re: query optimization