Re: Perform COPY FROM encoding conversions in larger chunks

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Perform COPY FROM encoding conversions in larger chunks
Date: 2020-12-17 21:44:22
Message-ID: b9e3167f-f84b-7aa4-5738-be578a4db924@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

One of the patches in this patch set is worth calling out separately:
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch. Per
commit message:

Add direct conversion routines between EUC_TW and Big5.

Conversions between EUC_TW and Big5 were previously implemented by
converting the whole input to MIC first, and then from MIC to the
target encoding. Implement functions to convert directly between the
two.

The reason to do this now is that the next patch will change the
change the conversion function signature so that if the input is
invalid, we convert as much as we can and return the number of bytes
successfully converted. That's not possible if we use an intermediary
format, because if an error happens in the intermediary -> final
conversion, we lose track of the location of the invalid character in
the original input. Avoiding the intermediate step should be faster
too.

This patch is fairly independent of the others. It could be reviewed and
applied separately.

In order to verify that the new code is correct, I wrote some helper
plpgsql functions to generate all valid EUC_TW and Big5 byte sequences
that encode one character, and tested converting each of them. Then I
compared the the results with unpatched server, to check that the new
code performs the same conversion. This is perhaps overkill, but since
its pretty straightforward to enumerate all the input characters, might
as well do it.

For the sake of completeness, I wrote similar helpers for all the other
encodings and conversions. Except for UTF-8, there are too many formally
valid codepoints for that to feasible. This does test round-trip
conversions of all codepoints from all the other encodings to UTF-8 and
back, though, so there's pretty good coverage of UTF-8 too.

This test suite is probably too large to add to the source tree, but for
the sake of the archives, I'm attaching it here. The first patch adds
the test suite, including the expected output of each conversion. The
second patch contains expected output changes for the above patch to add
direct conversions between EUC_TW and Big5. It affected the error
messages for some byte sequences that cannot be converted. For example,
on unpatched master:

postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR: character with byte sequence 0x95 0xfd 0xcc in encoding
"MULE_INTERNAL" has no equivalent in encoding "BIG5"

With the patch:

postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR: character with byte sequence 0xfd 0xcc in encoding "EUC_TW" has
no equivalent in encoding "BIG5"

The old message talked about "MULE_INTERNAL" which exposes the
implementation detail that we used it as an intermediate in the
conversion. That can be confusing to a user, the new message makes more
sense. So that's also nice.

- Heikki

Attachment Content-Type Size
0001-Add-conversion-test-suite.patch.bz2 application/x-bzip 2.8 MB
0002-Expected-output-changes-for-patch-to-convert-directl.patch.bz2 application/x-bzip 412.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-12-17 21:47:04 Re: [HACKERS] [PATCH] Generic type subscripting
Previous Message Andrew Dunstan 2020-12-17 21:37:54 multi-install PostgresNode