Quick Links

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

From:	"Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To:	"Rural Hunter EXTERN" <ruralhunter(at)gmail(dot)com>, <pgsql-admin(at)postgresql(dot)org>
Subject:	Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
Date:	2012-04-16 13:34:46
Message-ID:	D960CB61B694CF459DCFB4B0128514C207C3374F@exadv11.host.magwien.gv.at
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

Please don't top post.

Rural Hunter wrote:
>>> My db is in utf-8, I have a row in my table say tmp_article and I wanted
>>> to generate ts_vector from the article content:
>>> select to_tsvector(content) from tmp_article;
>>> But I got this error:
>>> ERROR: invalid byte sequence for encoding "UTF8": 0xf481
>>>
>>> I am wondering how this could happen. I think if there was invalid UTF8
>>> bytes in the content, it shouldn't have been able to inserted into the
>>> tmp_article table as I sometimes see similar errors when inserting
>>> records to tmp_article. Am I right?

>> You are right in theory. A lot depends on your PostgreSQL version,
>> because
>> the efforts to prevent invalid strings from entering the database have
>> led to changes over the versions. Older versions are more permissive.
>>
>> To test the theory that the contents of the table are bad, you can
>> test if the same happens if you
>>
>> SELECT convert_to(content, 'UTF8') FROM tmp_article;

> =>SELECT convert_to(content, 'UTF8') FROM tmp_article;
> This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have
> millions of data in the db but this is the only one we met the problem.
> The bad data is inserted in recent days and we upgraded to 9.1.3 right
> after it was released.

Hmm, that is strange.
Could you find the row that causes the problem and post the
result of "CAST (content AS bytea)" for this row?

Do you use any nonstandard text search parsers or dictionaries?
What is the text search configuration you use (parameter
default_text_search_config)?

Yours,
Laurenz Albe

In response to

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen? at 2012-04-16 08:52:02 from Rural Hunter

Responses

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen? at 2012-04-16 14:47:39 from Rural Hunter

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Chris Ernst	2012-04-16 13:54:22	Re: Recreate primary key without dropping foreign keys?
Previous Message	Rural Hunter	2012-04-16 08:52:02	Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?