Quick Links

Re: BUG #5532: Valid UTF8 sequence errors as invalid

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Michael Lewis" <mikelikespie(at)gmail(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5532: Valid UTF8 sequence errors as invalid
Date:	2010-06-30 16:44:45
Message-ID:	12210.1277916285@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

"Michael Lewis" <mikelikespie(at)gmail(dot)com> writes:
> I'm using Python to sanitize my logs from invalid UTF8 characters before
> COPYing them into postgres. I came across this one sequence that seems to
> be valid UTF8 (in the extended range I believe).

It is not valid. See http://tools.ietf.org/html/rfc3629 --- a sequence
beginning with ED must have a second byte in the range 80-9F to be
legal, and this doesn't. The example you give would decode as U+DF2D,
ie part of a surrogate pair, which is specifically disallowed in UTF8
--- you're supposed to code the original character directly, not via a
surrogate pair. The primary reason for this rule is that otherwise
there are multiple ways to encode the same character, which can be a
security hazard.

> It goes through both pythons encoding as well as iconv without an error

You should file bugs against those tools.

regards, tom lane

In response to

BUG #5532: Valid UTF8 sequence errors as invalid at 2010-06-30 08:42:25 from Michael Lewis

Responses

Re: BUG #5532: Valid UTF8 sequence errors as invalid at 2010-06-30 18:05:24 from Mike Lewis

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Mike Lewis	2010-06-30 18:05:24	Re: BUG #5532: Valid UTF8 sequence errors as invalid
Previous Message	Tom Lane	2010-06-30 16:25:32	Re: BUG #5531: REGEXP_ REPLACE causes connection drop