Skip site navigation (1) Skip section navigation (2)

Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

From: Steven Schlansker <steven(at)trumpet(dot)io>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Date: 2010-08-19 22:12:53
Message-ID: 928E965A-66A5-48C4-AC05-D308616AF0BD@trumpet.io (view raw or flat)
Thread:
Lists: pgsql-bugspgsql-hackers
On Aug 19, 2010, at 2:35 PM, Tom Lane wrote:

> Steven Schlansker <steven(at)trumpet(dot)io> writes:
>> I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte, causing data corruption in backups and transferred data.
> 
> I was able to reproduce this on my own Mac.  Some tracing shows that the
> problem is that isspace(0x85) returns true when in locale en_US.utf-8.
> This causes array_in to drop the final byte of the array element string,
> thinking that it's insignificant whitespace.

The 0x85 seems to be the second byte of a multibyte UTF-8
sequence.  I'm not at all experienced with character encodings so I could
be totally off base, but isn't it wrong to ever call isspace(0x85), 
whatever the result may be, given that the actual character is 0xCF85?
(U+03C5, GREEK SMALL LETTER UPSILON)


>  I believe that you must
> not have produced the data file data.copy on a Mac, or at least not in
> that locale setting, because array_out should have double-quoted the
> array element given that behavior of isspace().

Correct, it was produced on a Linux machine.  That said, the charset
there was also UTF-8.

> 
> Now, it's probably less than sane for isspace() to be behaving that way,
> since in a UTF8-based locale 0x85 can't be regarded as a valid character
> code at all.  But I'm not hopeful about the results of filing a bug with
> Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious
> behaviors too, which they appear not to care much about.

I actually can't reproduce that behavior here:

#include <ctype.h>
#include <stdio.h>
int main() {
    printf("%d\n", isspace(0x85));
    return 0;
}

[steven(at)xxx:~]% gcc -o test test.c
[steven(at)xxx:~]% ./test
0
[steven(at)xxx:~]% locale
LANG="en_US.utf-8"
LC_COLLATE="en_US.utf-8"
LC_CTYPE="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME="en_US.utf-8"
LC_ALL=
[steven(at)xxx:~]% uname -a
Darwin xxx.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386 i386


Thanks much for your help,
Steven Schlansker


In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2010-08-19 22:24:41
Subject: Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Previous:From: Robert HaasDate: 2010-08-19 22:03:57
Subject: Re: proposal: tuplestore, tuplesort aggregate functions

pgsql-bugs by date

Next:From: Tom LaneDate: 2010-08-19 22:24:41
Subject: Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Previous:From: Tom LaneDate: 2010-08-19 21:35:01
Subject: Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group