From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Perform COPY FROM encoding conversions in larger chunks |
Date: | 2021-02-01 16:15:13 |
Message-ID: | 11d39e63-b80a-5f8d-8043-fff04201fadc@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 30/01/2021 20:47, John Naylor wrote:
> I took a look at v2, and for the first encoding I tried, it fails to
> report the error for invalid input:
That's embarassing...
> I believe the problem is in UtfToLocal(). I've attached a fix formatted
> as a text file to avoid confusing the cfbot. The fix keeps the debugging
> ereport() in case you find it useful.
Thanks. I fixed it slightly differently, and also changed LocalToUtf()
to follow the same pattern, even though LocalToUtf() did not have the
same bug.
> Some additional test coverage might be good here, but not sure how
> much work that would be. I didn't check any other conversions yet.
I added a bunch of tests for various built-in conversions.
> v2-0002 seems fine to me, I just have cosmetic comments here:
>
> + * the same, no conversion is required by we must still validate that the
>
> s/by/but/
>
> This comment in copyfrom_internal.h above the *StateData struct is the
> same as the corresponding one in copyto.c:
>
> * Multi-byte encodings: all supported client-side encodings encode
> multi-byte
> * characters by having the first byte's high bit set. Subsequent bytes
> of the
> * character can have the high bit not set. When scanning data in such an
> * encoding to look for a match to a single-byte (ie ASCII) character,
> we must
> * use the full pg_encoding_mblen() machinery to skip over multibyte
> * characters, else we might find a false match to a trailing byte. In
> * supported server encodings, there is no possibility of a false
> match, and
> * it's faster to make useless comparisons to trailing bytes than it is to
> * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii
> is true
> * when we have to do it the hard way.
>
> The references to pg_encoding_mblen() and encoding_embeds_ascii, are out
> of date for copy-from. I'm not sure the rest is relevant to copy-from
> anymore, either. Can you confirm?
Yeah, that comment is obsolete for COPY FROM, the encoding conversion
works differently now. Removed it from copyfrom_internal.h.
> This comment inside the struct is now out of date as well:
>
> * Similarly, line_buf holds the whole input line being processed. The
> * input cycle is first to read the whole line into line_buf, convert it
> * to server encoding there, and then extract the individual attribute
>
> HEAD has this macro already:
>
> /* Shorthand for number of unconsumed bytes available in raw_buf */
> #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len -
> (cstate)->raw_buf_index)
>
> It might make sense to create a CONVERSION_BUF_BYTES equivalent since
> the patch calculates cstate->conversion_buf_len -
> cstate->conversion_buf_index in a couple places.
Thanks for the review!
I spent some time refactoring and adding comments all around the patch,
hopefully making it all more clear. One notable difference is that I
renamed 'raw_buf' (which exists in master too) to 'input_buf', and
renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
patch again another day with fresh eyes, and also try to add some tests
for the corner cases at buffer boundaries.
Attached is a new set of patches. I added some regression tests for the
built-in conversion functions, which cover the bug you found, and many
other interesting cases that did not have test coverage yet. It comes in
two patches: the first patch uses just the existing convert_from() SQL
function, and the second one uses the new "noError" variants of the
conversion functions. I also kept the bug-fixes compared to the previous
patch version as a separate commit, for easier review.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Add-regression-tests-for-built-in-encoding-conver.patch | text/x-patch | 53.3 KB |
v3-0002-Change-conversion-function-signature.patch | text/x-patch | 155.7 KB |
v3-0003-Add-tests-for-the-new-noError-variants-of-built-i.patch | text/x-patch | 93.8 KB |
v3-0004-Fix-bugs-in-the-commit-to-change-conversion-funct.patch | text/x-patch | 4.1 KB |
v3-0005-Do-COPY-FROM-encoding-conversion-verification-in-.patch | text/x-patch | 42.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2021-02-01 16:49:05 | Re: Proposal: Save user's original authenticated identity for logging |
Previous Message | Justin Pryzby | 2021-02-01 15:47:14 | Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly |