Quick Links

Perform COPY FROM encoding conversions in larger chunks

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Perform COPY FROM encoding conversions in larger chunks
Date:	2020-12-16 12:17:58
Message-ID:	e7861509-3960-538a-9025-b75a61188e01@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I've been looking at the COPY FROM parsing code, trying to refactor it
so that the parallel COPY would be easier to implement. I haven't
touched parallelism itself, just looking for ways to smoothen the way.
And for ways to speed up COPY in general.

Currently, COPY FROM parses the input one line at a time. Each line is
converted to the database encoding separately, or if the file encoding
matches the database encoding, we just check that the input is valid for
the encoding. It would be more efficient to do the encoding
conversion/verification in larger chunks. At least potentially; the
current conversion/verification implementations work one byte a time so
it doesn't matter too much, but there are faster algorithms out there
that use SIMD instructions or lookup tables that benefit from larger inputs.

So I'd like to change it so that the encoding conversion/verification is
done before splitting the input into lines. The problem is that the
conversion and verification functions throw an error on incomplete
input. So we can't pass them a chunk of N raw bytes, if we don't know
where the character boundaries are. The first step in this effort is to
change the encoding and conversion routines to allow that. Attached
patches 0001-0004 do that:

For encoding conversions, change the signature of the conversion
function, by adding a "bool noError" argument and making them return the
number of input bytes successfully converted. That way, the conversion
function can be called in a streaming fashion: load a buffer with raw
input without caring about the character boundaries, call the conversion
function to convert it except for the few bytes at the end that might be
an incomplete character, load the buffer with more data, and repeat.

For encoding verification, add a new function that works similarly. It
takes N bytes of raw input, verifies as much of it as possible, and
returns the number of input bytes that were valid. In principle, this
could've been implemented by calling the existing pg_encoding_mblen()
and pg_encoding_verifymb() functions in a loop, but it would be too
slow. This adds encoding-specific functions for that. The UTF-8
implementation is slightly optimized by basically inlining the
pg_utf8_mblen() call, the other implementations are pretty naive.

- Heikki

Attachment	Content-Type	Size
0001-Add-new-mbverifystr-function-for-each-encoding.patch	text/x-patch	34.5 KB
0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch	text/x-patch	2.2 KB
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch	text/x-patch	5.5 KB
0004-Change-conversion-function-signature.patch	text/x-patch	151.0 KB
0005-Do-COPY-FROM-encoding-conversion-verification-in-lar.patch	text/x-patch	18.5 KB

Responses

Re: Perform COPY FROM encoding conversions in larger chunks at 2020-12-17 18:04:14 from Bruce Momjian
Re: Perform COPY FROM encoding conversions in larger chunks at 2020-12-17 21:44:22 from Heikki Linnakangas
Re: Perform COPY FROM encoding conversions in larger chunks at 2020-12-22 20:01:48 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2020-12-16 12:49:07	Deadlock between backend and recovery may not be detected
Previous Message	Brar Piening	2020-12-16 10:00:44	Re: Minor documentation error regarding streaming replication protocol