Re: [SPAM]-D] How to find broken UTF-8 characters ?

From: Justin Graf <justin(at)magwerks(dot)com>
To: Andreas <maps(dot)on(at)gmx(dot)net>, pgsql-sql(at)postgresql(dot)org
Subject: Re: [SPAM]-D] How to find broken UTF-8 characters ?
Date: 2010-04-29 13:05:23
Message-ID: 4BD98413.1010109@magwerks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 4/28/2010 10:34 PM, Andreas wrote:
> Hi,
>
> while writing the reply below I found it sounds like beeing OT but
> it's actually not.
> I just need a way to check if a collumn contains values that CAN NOT
> be converted from Utf8 to Latin1.
> I tried:
> Select convert_to (my_column::text, 'LATIN1') from my_table;
>
> It raises an error that says translated:
> ERROR: character 0xe28093 in encoding »UTF8« has no equivalent in
> »LATIN1«
>
> I'd like to select all those records.
> When I know which record has faulty content I can correct it.
>
> If this is really OT on the SQL list then please tell me where to ask.

That's easy enough you need to write an Update statement using regular
expression to replace all non legal Latin/ASCII char
http://www.postgresql.org/docs/8.4/interactive/functions-string.html
<http://www.postgresql.org/docs/8.4/interactive/functions-string.html>

the command is regexp_replace('MyBadString',
'SearchForallNoneAsccIIChars', 'ReplaceWithBlankString ')

I'm pretty sure this is the regualr expression to find all non ASCII
chars.. [^\x00-\xFF]

To test is try to Select regexp_replace( MyColumn, '[^\x00-\xFF]', ' ')
from screweduptable

If the regular expression does not work, I'm dry well, when it comes to
regular expressions. Dd i say i hate regular expression. It dam near
impossible to write.
Once you get the expression right and working

the Update is straight forward.
Update mytable set mybadcolumn = regexp_replace( mybadcolumn,
'[^\x00-\xFF]', ' ')

>> Select covert('MyUtf8', 'UTF8', 'LATIN')
>> or
>> Select covert_to('MyUtf8', 'LATIN')
>
> I found them before but didn't understand their output.
> e.g.
> Select convert('1aäßx', 'utf8', 'LATIN1') ;
> Result = "1a\344\337x"
> so it translated ä = 344 and ß = 337. The other 3 are just as they
> were before.
> How can this be valid in a single byte charset like Latin1?
> Especially as ä, ß are E4 and DF.
> Why do they come out as escaped codes when they are in Latin1 aswell
> as 1, a and x?

Someone with more knowledge how convert() works is going to have to
explain why they have been escaped. PgAdmin may have escaped them. But
those characters are valid Latin1 characters

http://en.wikipedia.org/wiki/%C3%84
http://en.wikipedia.org/wiki/%C3%9F
ß = latin Beta

It seems Access and Excel are putting in extra bits of data into the
field. In the past i had to change inserts/updates from Access so it
would send data in a specific char encoding. I had problems where
Access was using a Windows Encoding, the ODBC converted it to Latin1,
and the MsSQL Database put it in UTF8. It was no fun cleaning it up.

>
>> What ever pg client library used to move Excel data to PG my have
>> incorrectly converted some of the data or moved formatting
>> information into the database. I have seen Access and Excel do
>> mightily odd things when connecting to DB's I don't know about
>> current versions but 2000 and 2003 Excels did really stupid things
>> when trying to write to DB's including MSSQL.
>
> Cute ... we use Access 2000 and 2003 :(

Been more Screwed by Excel and Access flakiness, and Access Programmers
thinking they are DBA's.

All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately.
Thank you.

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message sandeep prakash dhumale 2010-04-29 13:13:40 Tsearch not searching 'Y'
Previous Message Edward Ross 2010-04-29 07:59:09 Re: problem converting strings to timestamps with time zone