Re: finding bogus UTF-8

From: dennis jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com>
To: PostgreSQL general <pgsql-general(at)postgresql(dot)org>
Subject: Re: finding bogus UTF-8
Date: 2011-02-10 22:14:12
Message-ID: AANLkTi=VwFEMRst2Wpw-ZM-1Qr0+jiYtL+ETSO1ueUC7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'm working on a project to convert a large database form SQL_ASCII to
UTF-8. I am using this procedure:

1) pg_dump the SQL_ASCII database to an SQL text file.
2) Run through a small (efficient) C program that logs each line that
contains ANY "unclean" ASCII text.
3) Parse that log with a small perl program (hashes are easier in perl
than C) to produce a report, and emit some SQL.
4) Construct SQL update statements to "repair" the original data.
5) Repeat at step #1 until the database is clean.
6) pg_dump (SQL_ASCII) -> pg_restore -EUTF8 new database.
7) Profit!

If you are interested, I can email to you the C and Perl source.

It runs like this:

# time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester |
./bad-ascii-report.pl > unclean-ascii.rpt

real 11m11.804s
user 18m2.579s
sys 2m25.803s

# grep "^--" unclean-ascii.rpt
-- some_table 4051021
-- other_table 16

^^^ Numbers are count of rows that need cleaning.
Entire "rpt" file contains SQL comments "--" and SQL select statements
of the form:
select * from table where primary_key in (1, 2, 3, 4, ....);

The perl script contains a hash that maps table names to primary key
column IDs (to pick up when parsing the raw SQL restore "COPY"
script). I will need to purge my secret schema stuff from it before
sharing it with anyone.

My solution is probably not perfect, and probably not optimal, but it
is working great so far. I'm almost done cleaning up my database and
hope to attempt a real UTF8 restore in the near future.

On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe(at)elevated-dev(dot)com> wrote:
> I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?
>
>
> --
> Scott Ribe
> scott_ribe(at)elevated-dev(dot)com
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message dennis jenkins 2011-02-10 22:18:03 Re: finding bogus UTF-8
Previous Message Andy Colson 2011-02-10 22:14:06 Re: Schema version control