Quick Links

Re: error while trying to change the database encoding on a database

From:	Adrian Klaver <adrian(dot)klaver(at)gmail(dot)com>
To:	Geoffrey Myers <lists(at)serioustechnology(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: error while trying to change the database encoding on a database
Date:	2011-01-24 20:44:58
Message-ID:	4D3DE4CA.3080706@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 01/24/2011 10:57 AM, Geoffrey Myers wrote:
> Adrian Klaver wrote:
>> On 01/24/2011 09:16 AM, Geoffrey Myers wrote:
>>
>>>
>>> We hope to identify the characters and fix them in the existing
>>> database, then convert. It appears to be very limited, but it would help
>>> if there was some way to identify these characters outside of simply
>>> doing the reload of the data and finding the errors.
>>>
>>> Hence the reason I asked about a resource that might identify the
>>> characters.
>>
>> The problem is that from the standpoint of the SQL_ASCII database
>> there is nothing wrong with the characters per se. AFAIK there is no
>> built in function to validate characters. The reason is that valid is
>> determined by the encoding and if you know the encoding then you
>> really don't need to determine validity. If you want to see one way
>> others have tackled this, search on iconv in the mailing list archive.
>> This requires working on an external copy of the data and knowing
>> something about the encodings involved. The nearest I could ever find
>> to an encoding detector is:
>>
>> http://chardet.feedparser.org/
>>
>> It is a Python program and the encodings it detects are limited but it
>> might work for you.
>>
>> Given all the above, when I was faced with the problem you are facing
>> I found it easiest to make an educated guess as to the original
>> encoding and then do test restores with client_encoding set to my guess.
>
> Understood. We had figured the problem to be small, and it appears it is
> and thus felt we could address it a character at a time. Then get this
> error:
>
> pg_restore: [archiver (db)] Error from TOC entry 5258; 0 17549 TABLE
> DATA fax postgres
> pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence
> for encoding "UTF8": 0xe28053
>
> That hex value doesn't translate to a single character. I've dumped the
> data to a file as you suggested, but reviewing the identified line
> brings no joy.
>

The only thing I can think of is to use iconv like:

iconv -c -t utf8 -f utf8 -o converted_txt.txt 'original.txt'

where original.txt is your plain text data dump. The -c switch causes
iconv not to convert any illegal characters.

You could then run a diff against converted_txt.txt and 'original.txt'
to see what characters in the original text are causing the problem.

--
Adrian Klaver
adrian(dot)klaver(at)gmail(dot)com

In response to

Re: error while trying to change the database encoding on a database at 2011-01-24 18:57:15 from Geoffrey Myers

Browse pgsql-general by date

	From	Date	Subject
Next Message	Albretch Mueller	2011-01-24 22:09:19	Separating the ro directory of the DB engine itself from the rw data areas . . .
Previous Message	Attila Nagy	2011-01-24 20:34:29	Re: Postgresql as a dictionary coder backend?