Re: Re: Perform COPY FROM encoding conversions in larger chunks

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Perform COPY FROM encoding conversions in larger chunks
Date: 2021-05-01 20:06:19
Message-ID: 608DB4BB.8080204@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/01/21 05:27, Heikki Linnakangas wrote:
> I read through the patches one more time, made a few small comment fixes,
> and pushed.

Wow, this whole thread escaped my attention at the time, though my ears
would have perked right up if the subject had been something like
'improve encoding conversion API to stream a buffer at a time'. I think
this is of great interest beyond one particular use case in COPY FROM.
For example, it could limit the allocations needed when streaming a large
text value out to a client; it might be used to advantage with the recent
work in incrementally detoasting large values, and so on.

This part seems a little underdeveloped:

> * TODO: The conversion function interface is not great. Firstly, it
> * would be nice to pass through the destination buffer size to the
> * conversion function, so that if you pass a shorter destination buffer, it
> * could still continue to fill up the whole buffer. Currently, we have to
> * assume worst case expansion and stop the conversion short, even if there
> * is in fact space left in the destination buffer. Secondly, it would be
> * nice to return the number of bytes written to the caller, to avoid a call
> * to strlen().

If I understand correctly, this patch already makes a breaking change to
the conversion function API. If that's going to be the case anyway, I wonder
if it's worth going further and changing the API further to eliminate this
odd limitation.

There seems to be a sort of common shape that conversion APIs have evolved
toward, that can be seen in both the ICU4C converters [0] and in Java's [1].
This current tweak to our conversion API seems to get allllmmoooosst there,
but just not quite. For example, noError allows us to keep control when
the function has stopped converting, but we don't find out which reason
it stopped.

If we just went the rest of the way and structured the API like those
existing ones, then:

- it would be super easy to write wrappers around ICU4C converters, if
there were any we wanted to use;

- I could very easily write wrappers presenting any PG-supported charset
as a Java charset.

The essence of the API common to ICU4C and Java is this:

1. You pass the function the address and length of a source buffer,
the address and length of a destination buffer, and a flag that is true
if you know there is no further input where this source buffer came from.
(It's allowable to pass false and only then discover you really have no
more input after all; then you just make one final call passing true.)

2. The function eats as much as it can of the source buffer, fills as much
as it can of the destination buffer, and returns indicating one of four
reasons it stopped:

underflow - ran out of source buffer
overflow - ran out of destination buffer
malformed - something in source buffer isn't valid in that representation
unmappable - a valid codepoint not available in destination encoding

Based on that, the caller refills the source buffer, or drains the
destination buffer, or handles or reports the malformed or unmappable,
then repeats.

3. The function should update pointers on return to indicate how much
of the source buffer it consumed and how much of the destination buffer
it filled.

4. If it left any bytes unconsumed in the source buffer, the caller must
preserve them (perhaps moved to the front) for the next call.

5. The converter can have internal state (so it is an object in Java, or
has a UConverter struct allocated in ICU4C, to have a place for its
state). The state gets flushed on the final call where the flag is
passed true. In many cases, the converter can be implemented without
keeping internal state, if it simply leaves, for example, an
incomplete sequence at the end of the source buffer unconsumed, so the
caller will move it to the front and supply the rest. On the other hand,
any unconsumed input after the final call with flush flag true must be
treated as malformed.

6. On a malformed or unmappable return, the source buffer is left pointed
at the start of the offending sequence and the length in bytes of
that sequence is available for error reporting/recovery.

The efficient handling of states, returning updated pointers, and so on,
probably requires a function signature with 'internal' in it ... but
the current function signature already has 'internal', so that doesn't
seem like a deal-breaker.

Thoughts? It seems a shame to make a breaking change in the conversion
API, only to still end up with an API that "is not great" and is still
impedance-mismatched to other existing prominent conversion APIs.

Regards,
-Chap

[0]
https://unicode-org.github.io/icu/userguide/conversion/converters.html#3-buffered-or-streamed

[1]
https://docs.oracle.com/javase/9/docs/api/java/nio/charset/CharsetDecoder.html#decode-java.nio.ByteBuffer-java.nio.CharBuffer-boolean-

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Yu Zhao 2021-05-01 20:27:29 Re: performance benchmark
Previous Message Tom Lane 2021-05-01 19:46:02 Regex performance regression induced by match-all code