Quick Links

Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc:	steven(at)trumpet(dot)io, pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Date:	2010-08-20 19:47:03
Message-ID:	25780.1282333623@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> writes:
>> We generally assume that in server-safe encodings, the ctype.h functions
>> will behave sanely on any single-byte value.

> I think this "wisedom" is only true for C locale. I'm not surprised
> all that it does not work with non C locales.

> From array_funcs.c:

> while (isspace((unsigned char) *p))
> p++;

> IMO this should be something like:

> while (isspace((unsigned char) *p))
> p += pg_mblen(p);

I don't think that's likely to help at all. The risk is that isspace
will do something not-sane with a fragment of a character. If it's not
coded to guard against that, it's just as likely to give wrong results
for the leading byte as for non-leading bytes. (In the case at hand,
I think the underlying problem is that it imagines what it's given is
a Unicode code point, not a byte of a UTF8 string. There apparently
aren't any code points in the range U+00C0 - U+00FF for which isspace
is true, but that's not true for isalpha for example.)

If we were going to try to code around this, we'd need to change all
these loops to look something like

while ((isascii((unsigned char) *p) ||
pg_database_encoding_max_length() == 1) &&
isspace((unsigned char) *p))
p += pg_mblen(p); // or p++, it wouldn't matter

However, given the limited number of platforms where this is an issue
and the fact that it is an acknowledged bug on those platforms,
I'm not eager to go there.

In any case, no matter whether we changed that or not, we'd still have
the problem that it's a bad idea to have any locale-dependent behavior
in array_in; and the behavior *would* still be locale-dependent, at
least in single-byte encodings.

regards, tom lane

In response to

Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence at 2010-08-19 23:29:57 from Tatsuo Ishii

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2010-08-20 19:50:13	Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Previous Message	Tom Lane	2010-08-20 18:19:17	Re: BUG #5626: Parallel pg_restore fails with "tuple concurrently updated"

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-08-20 19:50:13	Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Previous Message	David E. Wheeler	2010-08-20 19:46:28	Re: Version Numbering