Re: v7.4 pg_dump(all) need to encode from SQL_ASCII to UTF8

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ralph Smith <smithrn(at)washington(dot)edu>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: v7.4 pg_dump(all) need to encode from SQL_ASCII to UTF8
Date: 2008-02-26 23:24:08
Message-ID: 9802.1204068248@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ralph Smith <smithrn(at)washington(dot)edu> writes:
> I'm not sure if you're saying I should ignore these errors...

No, not at all.

> I'm using dumps from DB airaburst.

Doesn't look like that --- you have

> Name | Owner | Encoding
> ------------+----------+-----------
> airburst | root | SQL_ASCII

but the dump contains

> SET client_encoding = 'UTF8';

which indicates that it came from a database that claimed to have UTF8
encoding. (Hmm .... although it's just barely possible that you have
PGCLIENTENCODING set in pg_dump's environment?)

> psql:./table_board_posts.sql:248: ERROR: invalid byte sequence for
> encoding "UTF8": 0x91

In any case, this failure is pretty strong evidence that what is in the
dump is actually *not* UTF8 data, or at least not all of it is. (I'd bet
on this particular value being in some LATINn encoding.) What you're
going to need to do is figure out exactly what encoding the data really
has. If you're lucky and it's all the same encoding, you can adjust it
to UTF8 by running the dump file through iconv, or just edit the SET
client_encoding command in the dump to match the true encoding (then
PG will take care of converting it to UTF8 during the load).

If you're not lucky, you have a mismash of differently encoded data,
and I'm afraid you're in for some unpleasant tedium getting it all into
one encoding.

The reason you're suffering this pain is that 7.x was not very good
about checking or enforcing encoding validity. Current PG is much
stricter; cleaning up the data will cost you some pain now but it'll be
a good investment in the long run.

Alternatively, if you don't particularly *care* about encoding issues
and feel that everything was working fine before, you can create your
new DB with SQL_ASCII encoding (which actually means "no known
encoding") and PG will be just as lax as it was before. But if you want
to say that the database uses UTF8 encoding, you need to present validly
encoded data.

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Joris Dobbelsteen 2008-02-26 23:51:34 Re: win32: how to backup (dump does not work)
Previous Message Tim Uckun 2008-02-26 22:11:52 citext in windows.