Perform COPY FROM encoding conversions in larger chunks

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Perform COPY FROM encoding conversions in larger chunks
Date: 2020-12-16 12:17:58
Message-ID: e7861509-3960-538a-9025-b75a61188e01@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been looking at the COPY FROM parsing code, trying to refactor it
so that the parallel COPY would be easier to implement. I haven't
touched parallelism itself, just looking for ways to smoothen the way.
And for ways to speed up COPY in general.

Currently, COPY FROM parses the input one line at a time. Each line is
converted to the database encoding separately, or if the file encoding
matches the database encoding, we just check that the input is valid for
the encoding. It would be more efficient to do the encoding
conversion/verification in larger chunks. At least potentially; the
current conversion/verification implementations work one byte a time so
it doesn't matter too much, but there are faster algorithms out there
that use SIMD instructions or lookup tables that benefit from larger inputs.

So I'd like to change it so that the encoding conversion/verification is
done before splitting the input into lines. The problem is that the
conversion and verification functions throw an error on incomplete
input. So we can't pass them a chunk of N raw bytes, if we don't know
where the character boundaries are. The first step in this effort is to
change the encoding and conversion routines to allow that. Attached
patches 0001-0004 do that:

For encoding conversions, change the signature of the conversion
function, by adding a "bool noError" argument and making them return the
number of input bytes successfully converted. That way, the conversion
function can be called in a streaming fashion: load a buffer with raw
input without caring about the character boundaries, call the conversion
function to convert it except for the few bytes at the end that might be
an incomplete character, load the buffer with more data, and repeat.

For encoding verification, add a new function that works similarly. It
takes N bytes of raw input, verifies as much of it as possible, and
returns the number of input bytes that were valid. In principle, this
could've been implemented by calling the existing pg_encoding_mblen()
and pg_encoding_verifymb() functions in a loop, but it would be too
slow. This adds encoding-specific functions for that. The UTF-8
implementation is slightly optimized by basically inlining the
pg_utf8_mblen() call, the other implementations are pretty naive.

- Heikki

Attachment Content-Type Size
0001-Add-new-mbverifystr-function-for-each-encoding.patch text/x-patch 34.5 KB
0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch text/x-patch 2.2 KB
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch text/x-patch 5.5 KB
0004-Change-conversion-function-signature.patch text/x-patch 151.0 KB
0005-Do-COPY-FROM-encoding-conversion-verification-in-lar.patch text/x-patch 18.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-12-16 12:49:07 Deadlock between backend and recovery may not be detected
Previous Message Brar Piening 2020-12-16 10:00:44 Re: Minor documentation error regarding streaming replication protocol