invalid byte sequence on restore

From: "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de>
To: <pgsql-general(at)postgresql(dot)org>
Subject: invalid byte sequence on restore
Date: 2008-11-14 12:58:04
Message-ID: 28011CD60FB1724DBA4442E38277F6260AFCD334@hermes.computec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

I am currently struggling with a couple oif tainted bytes in one of our PostgreSQL 8.2 databases which I plan to move to 8.3 soon - so I need to dump & restore.

I think this problem bit me almost every single time during a major upgrade in the last couple of years, so I must say that I have become somewhat used to the procedure. Before now I have always used the plain text format dump, which I fed through iconv in order to correct encoding errors. This time I also had to convert from the 8.2 contrib-tsearch2 to 8.3 core-tsearch2, using the helpful instructions found at http://sql-info.de/postgresql/notes/converting-tsearch2-to-8.3.html

Unfortunately this involves using the custom dump format and pg_restore. Using iconv on a custom dump is most probably not such a good idea :)

On restoring, I received errors like the following:

pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0x80
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xcd09

Fortunately I'm restoring to a test server, so the original DB is still available to me - and so I decided to correct the encoding errors before dumping. I just had to find out what to correct - and that's the tricky bit. pg_restore will error out on the first occurrence of an invalid byte sequence for a table and so the table remains empty on restore and what's more: Even when you would find this one spot per chance from the little information you've got now and you'd correct it before dumping again, you'd never know if a similar issue wouldn't bite you a few lines further down in the dump on the next attempt. So it's better to sieve through the complete contents of the affected tables before attempting another restore.

Here's a possible path of actions to resolve the issue:

1. Take a look in the PostgreSQL logfile of the server you restore to in order to determine the tables where the error occurs. For each affected table, you'll find a couple of lines like the following:
somedb>ERROR: invalid byte sequence for encoding "UTF8": 0x80
somedb>HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
somedb>CONTEXT: COPY topsearchterms, line 8998
somedb>STATEMENT: COPY topsearchterms (searchterm, usercount) FROM stdin;
In this case, a table named topsearchterms is affected. Unfortunately you don't get to know which schema, but that doesn't matter right now.

2. Now get a list of all the objects in your custom dump:
# pg_restore --disable-triggers -U postgres -Fc somedb-data.pg -l>somedb_objects.txt

3. In somedb_objects.txt, comment out everything but the lines for the tables where the errors occur; be sure to keep all tables with matching names and table column definitions in there, no matter which schema.

4. Restore those tables' contents into a file:
# pg_restore --disable-triggers -U postgres -Fc somedb-data.pg -L somedb_objects.txt>broken_tables.txt

5. Now filter the tables' contents through iconv
# cat broken_tables.txt | ./iconv-chunks - -c -f utf8 -t utf8 | fixed_tables.txt
As my databases are quite big, I always use this helpful script here: http://maurice.aubrey.googlepages.com/iconv-chunks.txt - this is feeding the input in chunks to iconv, thus avoiding memory exhaustion.

6. Now you can simply use diff to find the affected tuples:
# diff broken_tables.txt fixed_tables.txt

7. Even when diff output is not enough in itself, it will give you the line numbers, where the error occurs. So fire up your favorite editor and examine these lines in broken_tables.txt.

8. Update your affected tables in your original database.

9. Dump & reload again - this time it'll hoepfully run smoothly :)

I hope that this may help somebody facing the same problem. I'd also welcome any suggestions on how to improve on this procedure.

Kind regards

Markus

Computec Media AG
Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818)
Vorstandsmitglieder: Johannes S. Gözalan (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jürg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martijn van Oosterhout 2008-11-14 13:03:22 Re: invalid byte sequence on restore
Previous Message serge.luc-cayol 2008-11-14 12:56:34 trouble in installing postgreSQL 8.3