Re: finding bogus UTF-8

From: dennis jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com>
To: Scott Ribe <scott_ribe(at)elevated-dev(dot)com>
Cc: PostgreSQL general <pgsql-general(at)postgresql(dot)org>
Subject: Re: finding bogus UTF-8
Date: 2011-02-10 22:18:03
Message-ID: AANLkTikW0Yb8ZyULMVLgdLsv-A+6O_TF3rHmtdxX8mUG@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe(at)elevated-dev(dot)com> wrote:
> I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?

If you know which table and column the data is in, you can also do
something like this:

(I typed this up without checking the syntax of it. The basic idea is
to cast the column as bytea, encode with the 'escape' method, then
grep for back-slashes).

select * from bad_table where regexp_match (encode (bad_column::bytea,
'escape'), '\\\\'));

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andy Colson 2011-02-10 22:18:15 Re: Schema version control
Previous Message dennis jenkins 2011-02-10 22:14:12 Re: finding bogus UTF-8