Skip site navigation (1) Skip section navigation (2)

Re: Differences in UTF8 between 8.0 and 8.1

From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-24 05:07:40
Message-ID: slrndlor0s.g61.andrew+nonews@trinity.supernews.net (view raw or flat)
Thread:
Lists: pgsql-hackers
On 2005-10-24, Paul Lindner <lindner(at)inuus(dot)com> wrote:
> Here's a cut and paste from emacs hexl-mode:
>
> 00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55  5083c5083.< VALU
> 00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4  ES (202553, '...
> 00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056  .....', .---.> V
> 00000030: 414c 5545 5320 2832 3032 3535 332c 2027  ALUES (202553, '
> 00000040: d2b9 272c 200a 3136 3939 3432 6331 3639  ..', .169942c169
> 00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833  942.< VALUES (83
> 00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a  015, '.....H', .
> 00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833  ---.> VALUES (83
> 00000080: 3031 352c 2027 c6a4 4827 2c20 0a         015, '..H', .
>
> This is of a minimal diff between a UTF8 scrubbed file and the
> original dump.
>
> It appears the offending bytes are:
>
>   C1 F9 C2 D0 C7

I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
was never actually a valid utf-8 string, and that the d2 b9 is only valid
by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
utf-8 check was broken, but I didn't realize it was quite so bad.

> and
>
>   B7 ED A8

Likewise, that whole sequence b7 ed a8 c6 a4 was probably never valid;
c6 a4 also isn't a character you'd expect to find in common use.

My guess is that this was data in some non-utf-8 charset that managed to
get past the defective checks in 8.0.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

In response to

Responses

pgsql-hackers by date

Next:From: Simon RiggsDate: 2005-10-24 07:47:05
Subject: Re: On externals sorts and other IO bottlenecks in
Previous:From: Christopher Kings-LynneDate: 2005-10-24 05:06:02
Subject: Re: Differences in UTF8 between 8.0 and 8.1

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group