Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Steven Schlansker <steven(at)trumpet(dot)io>
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Date: 2010-08-19 21:35:01
Message-ID: 28097.1282253701@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Steven Schlansker <steven(at)trumpet(dot)io> writes:
> I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte, causing data corruption in backups and transferred data.

I was able to reproduce this on my own Mac. Some tracing shows that the
problem is that isspace(0x85) returns true when in locale en_US.utf-8.
This causes array_in to drop the final byte of the array element string,
thinking that it's insignificant whitespace. I believe that you must
not have produced the data file data.copy on a Mac, or at least not in
that locale setting, because array_out should have double-quoted the
array element given that behavior of isspace().

Now, it's probably less than sane for isspace() to be behaving that way,
since in a UTF8-based locale 0x85 can't be regarded as a valid character
code at all. But I'm not hopeful about the results of filing a bug with
Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious
behaviors too, which they appear not to care much about.

In any case, stepping back and taking a larger viewpoint, it seems unsafe
for array_in/array_out to have any locale-sensitive behavior anyhow.
As an example, in a LATINx locale it is entirely sane for isspace() to
return true for 0xA0, while it should certainly not do so in C locale.
This means we are at risk of data corruption, ie dropping a valid data
character, when an array value starting or ending with 0xA0 is dumped
from a C-locale database and loaded into a LATINx-locale one.

So it seems like the safest answer is to modify array_in/array_out to
use an ASCII-only definition of isspace(). I believe this is
traditionally defined as space, tab, CR, LF, VT, FF. We could perhaps
trim that further, like just space and tab, but there might be some risk
of breaking client code that expects the other traditional whitespace to
be ignored.

I'm not sure if there are any other places with similar risks.
hstore's I/O routines contain isspace calls, but I haven't
analyzed the implications. There is an isspace call in record_out
but it is just there for cosmetic purposes and doesn't protect
any decisions in record_in, so I think it's okay if it makes
platform/locale-dependent choices.

Comments?

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Steven Schlansker 2010-08-19 22:12:53 Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Previous Message Martin Pitt 2010-08-19 21:11:12 Re: libpq: system-wide root.crt

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2010-08-19 21:37:49 Re: Old git repo
Previous Message Christopher Browne 2010-08-19 21:33:44 Re: Old git repo