Re: Make COPY format extendable: Extract COPY TO format implementations

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Sutou Kouhei <kou(at)clear-code(dot)com>, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-02-06 02:41:06
Message-ID: ZcGcQq3Y6FeX1z4e@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 05, 2024 at 05:41:25PM -0800, Andres Freund wrote:
> On 2024-02-06 10:01:36 +0900, Michael Paquier wrote:
>> If you have concerns about that, I'm OK to revert, I'm not wedded to
>> this level of control. Note that I've actually seen *better*
>> runtimes.
>
> I'm somewhat worried that handling the different formats at that level will
> make it harder to improve copy performance - it's quite attrociously slow
> right now. The more we reduce the per-row/field overhead, the more the
> dispatch overhead will matter.

Yep. That's the hard part when it comes to design these callbacks.
We don't want something too high level because this leads to more code
duplication churns when someone wants to plug in its own routine set,
and we don't want to be at a too low level because of the indirect
calls as you said. I'd like to think that the current CopyFromOneRow
offers a good balance here, avoiding the "if" branch with the binary
and non-binary paths.

>> Hmm. Do you have concerns about v13 posted on [2] then?
>
> As is I'm indeed not a fan. It imo doesn't make sense to have an indirect
> dispatch for *both* ->copy_attribute_out *and* ->CopyToOneRow. After all, when
> in ->CopyToOneRow for text, we could know that we need to call
> CopyAttributeOutText etc.

Right.

>> If yes, then I'd assume that this shuts down the whole thread or that it
>> needs a completely different approach, because we will multiply indirect
>> function calls that can control how data is generated for each row, which is
>> the original case that Sutou-san wanted to tackle.
>
> I think it could be rescued fairly easily - remove the dispatch via
> ->copy_attribute_out(). To avoid duplicating code you could use a static
> inline function that's used with constant arguments by both csv and text mode.

Hmm. So you basically mean to tweak the beginning of
CopyToTextOneRow() and CopyToTextStart() so as copy_attribute_out is
saved in a local variable outside of cstate and we'd save the "if"
checked for each attribute. If I got that right, it would mean
something like the v13-0002 attached, on top of the v13-0001 of
upthread. Is that what you meant?

> I think it might also be worth ensuring that future patches can move branches
> like
> if (cstate->encoding_embeds_ascii)
> if (cstate->need_transcoding)
> into the choice of per-row callback.

Yeah, I'm still not sure how much we should split CopyToStateData in
the initial patch set. I'd like to think that the best result would
be to have in the state data an opaque (void *) that points to a
structure that can be set for each format, so as there is a clean
split between which variable gets set and used where (same remark
applies to COPY FROM with its raw_fields, raw_fields, for example).
--
Michael

Attachment Content-Type Size
v13-0001-Extract-COPY-FROM-TO-format-implementations.patch text/x-diff 36.3 KB
v13-0002-Optimize-copy_attribute_out.patch text/x-diff 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-02-06 02:57:58 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Andres Freund 2024-02-06 02:05:04 confusing / inefficient "need_transcoding" handling in copy