Re: How to remove non-UTF values from a table?

From: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
To: Howard Cole <howardnews(at)selestial(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to remove non-UTF values from a table?
Date: 2009-12-16 01:13:29
Message-ID: e373d31e0912151713x4697a862j97d8f7d8d8a1c20a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews(at)selestial(dot)com> wrote:
> Phoenix Kiula wrote:
>>
>> An easy question for some I hope.
>>
>> I have a DB from 8.2 days that when I now dump and try to take into
>> the 8.3.7, it gives me errors about utf-8 stuff.
>>
>> I tried searching this list's archives but could not come up with an
>> answer.
>>
>> Google returns some sites like these:
>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>> but I'm not clear on how to use them.
>>
>> Following the SQL on this site I could identify some columns that
>> contain text like this:
>>
>>    "Évolution générale de la situation démographique"
>>
>> So my guess is that the non-English characters were originally not
>> getting written in proper utf-8 variants.
>>
>> Is there any SQL possibility to find these columns and replace them
>> with utf-8 equivalents using some postgresql commands? Couldn't find
>> anything in the "Strings functions" (chapter 9 of manual).
>>
>> We're on CentOS.
>>
>> Thanks!
>>
>>
>
> My recommendation would be to install the iconv utility and run it on a
> plain text (pg_dump -Fp) backup as suggested in the google article - and
> then reimport the clean UTF-8.
>
> I am surprised that you managed to install the original backup on 8.3
> because it seems to be much more strict on encoding - Unless your database
> is not in UTF-8?

Thanks Howard.

I ran the SQL and it finds anything that has non-English characters.
For example:

http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1

Part of this URL is actually in Japanese, but when I paste it in this
email it comes up with all these percentage signs. I suppose this is
"url encoded".

Shouldn't this be valid UTF-8? How does PG calculate if something is
not valid UTF-8?

Thanks.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2009-12-16 02:34:50 Re: Possible causes for database corruption and solutions
Previous Message Scott Marlowe 2009-12-16 01:07:48 Re: Possible causes for database corruption and solutions