From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Steven Schlansker <steven(at)trumpet(dot)io> |
Cc: | pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence |
Date: | 2010-08-19 22:24:41 |
Message-ID: | 28944.1282256681@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Steven Schlansker <steven(at)trumpet(dot)io> writes:
> On Aug 19, 2010, at 2:35 PM, Tom Lane wrote:
>> I was able to reproduce this on my own Mac. Some tracing shows that the
>> problem is that isspace(0x85) returns true when in locale en_US.utf-8.
>> This causes array_in to drop the final byte of the array element string,
>> thinking that it's insignificant whitespace.
> The 0x85 seems to be the second byte of a multibyte UTF-8
> sequence.
Check.
> I'm not at all experienced with character encodings so I could
> be totally off base, but isn't it wrong to ever call isspace(0x85),
> whatever the result may be, given that the actual character is 0xCF85?
> (U+03C5, GREEK SMALL LETTER UPSILON)
We generally assume that in server-safe encodings, the ctype.h functions
will behave sanely on any single-byte value. You can argue the wisdom
of that, but deciding to change that policy would be a rather massive
code change; I'm not excited about going that direction.
>> I believe that you must
>> not have produced the data file data.copy on a Mac, or at least not in
>> that locale setting, because array_out should have double-quoted the
>> array element given that behavior of isspace().
> Correct, it was produced on a Linux machine. That said, the charset
> there was also UTF-8.
Right ... but you had an isspace function that meets our expectations.
> I actually can't reproduce that behavior here:
You need a setlocale() call, else the program acts as though it's in C
locale regardless of environment. My test case looks like this:
$ cat isspace.c
#include <stdio.h>
#include <ctype.h>
#include <locale.h>
int main()
{
int c;
setlocale(LC_ALL, "");
for (c = 1; c < 256; c++)
{
if (isspace(c))
printf("%3o is space\n", c);
}
return 0;
}
$ gcc -O -Wall isspace.c
$ LANG=C ./a.out
11 is space
12 is space
13 is space
14 is space
15 is space
40 is space
$ LANG=en_US.utf-8 ./a.out
11 is space
12 is space
13 is space
14 is space
15 is space
40 is space
205 is space
240 is space
$
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Thue Janus Kristensen | 2010-08-19 22:46:52 | Re: BUG #5622: Query failed: server closed the connection unexpectedly |
Previous Message | Steven Schlansker | 2010-08-19 22:12:53 | Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence |
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2010-08-19 22:51:48 | Avoiding deadlocks ... |
Previous Message | Steven Schlansker | 2010-08-19 22:12:53 | Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence |