Re: Perform COPY FROM encoding conversions in larger chunks

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Perform COPY FROM encoding conversions in larger chunks
Date: 2021-01-30 18:47:06
Message-ID: CAFBsxsFp8QCnjdUdj+ZeXR3PzmTn2Xn=WTXUBCnf6=ovyn4Q+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 28, 2021 at 7:36 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> Even more surprising was that the second patch
> (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
> actually made things worse again. I thought it would give a modest gain,
> but nope.

Hmm, that surprised me too.

> Based on these results, I'm going to commit the first patch, but not the
> second one. There are much faster UTF-8 verification routines out there,
> using SIMD instructions and whatnot, and we should consider adopting one
> of those, but that's future work.

I have something in mind for that.

I took a look at v2, and for the first encoding I tried, it fails to report
the error for invalid input:

create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN'
LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;

\c euctest
create table foo (a text);

master:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
ERROR: character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no
equivalent in encoding "EUC_CN"
CONTEXT: COPY foo, line 1

patch:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
COPY 0
euctest=#

I believe the problem is in UtfToLocal(). I've attached a fix formatted as
a text file to avoid confusing the cfbot. The fix keeps the debugging
ereport() in case you find it useful. Some additional test coverage might
be good here, but not sure how much work that would be. I didn't check any
other conversions yet.

v2-0002 seems fine to me, I just have cosmetic comments here:

+ * the same, no conversion is required by we must still validate that the

s/by/but/

This comment in copyfrom_internal.h above the *StateData struct is the same
as the corresponding one in copyto.c:

* Multi-byte encodings: all supported client-side encodings encode
multi-byte
* characters by having the first byte's high bit set. Subsequent bytes of
the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character, we
must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is
true
* when we have to do it the hard way.

The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of
date for copy-from. I'm not sure the rest is relevant to copy-from anymore,
either. Can you confirm?

This comment inside the struct is now out of date as well:

* Similarly, line_buf holds the whole input line being processed. The
* input cycle is first to read the whole line into line_buf, convert it
* to server encoding there, and then extract the individual attribute

HEAD has this macro already:

/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len -
(cstate)->raw_buf_index)

It might make sense to create a CONVERSION_BUF_BYTES equivalent since the
patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index
in a couple places.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
fix-utf8-convertedbytes-calc.txt text/plain 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-01-30 19:06:39 Re: Add primary keys to system catalogs
Previous Message Michail Nikolaev 2021-01-30 17:11:16 Re: Thoughts on "killed tuples" index hint bits support on standby