Re: Make COPY format extendable: Extract COPY TO format implementations

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: andrew(at)dunslane(dot)net, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-01-25 23:35:19
Message-ID: ZbLwNyOKbddno0Ue@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 25, 2024 at 05:45:43PM +0900, Sutou Kouhei wrote:
> In <ZbHS439y-Bs6HIAR(at)paquier(dot)xyz>
> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 25 Jan 2024 12:17:55 +0900,
> Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> +extern CopyToRoutine CopyToRoutineText;
>> +extern CopyToRoutine CopyToRoutineCSV;
>> +extern CopyToRoutine CopyToRoutineBinary;
>>
>> All that should IMO remain in copyto.c and copyfrom.c in the initial
>> patch doing the refactoring. Why not using a fetch function instead
>> that uses a string in input? Then you can call that once after
>> parsing the List of options in ProcessCopyOptions().
>
> OK. How about the following for the fetch function
> signature?
>
> extern CopyToRoutine *GetBuiltinCopyToRoutine(const char *format);

Or CopyToRoutineGet()? I am not wedded to my suggestion, got a bad
history with naming things around here.

> We may introduce an enum and use it:
>
> typedef enum CopyBuiltinFormat
> {
> COPY_BUILTIN_FORMAT_TEXT = 0,
> COPY_BUILTIN_FORMAT_CSV,
> COPY_BUILTIN_FORMAT_BINARY,
> } CopyBuiltinFormat;
>
> extern CopyToRoutine *GetBuiltinCopyToRoutine(CopyBuiltinFormat format);

I am not sure that this is necessary as the option value is a string.

> Oh, sorry. I assumed that the comment style was adjusted by
> pgindent.

No worries, that's just something we get used to. I tend to fix a lot
of these things by myself when editing patches.

>> + getTypeBinaryOutputInfo(attr->atttypid, &out_func_oid, &isvarlena);
>> + fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]);
>>
>> Actually, this split is interesting. It is possible for a custom
>> format to plug in a custom set of out functions. Did you make use of
>> something custom for your own stuff?
>
> I didn't. My PoC custom COPY format handler for Apache Arrow
> just handles integer and text for now. It doesn't use
> cstate->out_functions because cstate->out_functions may not
> return a valid binary format value for Apache Arrow. So it
> formats each value by itself.

I mean, if you use a custom output function, you could tweak things
even more with byteas or such.. If a callback is expected to do
something, like setting the output function OIDs in the start
callback, we'd better document it rather than letting that be implied.

>> Actually, could it make sense to
>> split the assignment of cstate->out_functions into its own callback?
>
> Yes. Because we need to use getTypeBinaryOutputInfo() for
> "binary" and use getTypeOutputInfo() for "text" and "csv".

Okay. After sleeping on it, a split makes sense here, because it also
reduces the presence of TupleDesc in the start callback.

>> Sure, that's part of the start phase, but at least it would make clear
>> that a custom method *has* to assign these OIDs to work. The patch
>> implies that as a rule, without a comment that CopyToStart *must* set
>> up these OIDs.
>
> CopyToStart doesn't need to set up them if the handler
> doesn't use cstate->out_functions.

Noted.

>> I think that 0001 and 0005 should be handled first, as pieces
>> independent of the rest. Then we could move on with 0002~0004 and
>> 0006~0008.
>
> OK. I'll focus on 0001 and 0005 for now. I'll restart
> 0002-0004/0006-0008 after 0001 and 0005 are accepted.

Once you get these, I'd be interested in re-doing an evaluation of
COPY TO and more tests with COPY FROM while running Postgres on
scissors. One thing I was thinking to use here is my blackhole_am for
COPY FROM:
https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am

As per its name, it does nothing on INSERT, so you could create a
table using it as access method, and stress the COPY FROM execution
paths without having to mount Postgres on a tmpfs because the data is
sent to the void. Perhaps it does not matter, but that moves the
tests to the bottlenecks we want to stress (aka the per-row callback
for large data sets).

I've switched the patch as waiting on author for now. Thanks for your
perseverance here. I understand that's not easy to follow up with
patches and reviews (^_^;)
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2024-01-26 00:28:45 Re: speed up a logical replica setup
Previous Message Euler Taveira 2024-01-25 23:34:46 Re: speed up a logical replica setup