From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Sutou Kouhei <kou(at)clear-code(dot)com> |
Cc: | michael(at)paquier(dot)xyz, david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Make COPY format extendable: Extract COPY TO format implementations |
Date: | 2025-07-28 19:33:28 |
Message-ID: | CAD21AoB01+FZ_bOR4uOq_x_qBE8on1ASdJWm+UexxOwC1RGD-A@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jul 18, 2025 at 3:05 AM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
>
> Hi,
>
> In <CAD21AoAZL2RzPM4RLOJKm_73z5LXq2_VOVF+S+T0tnbjHdWTFA(at)mail(dot)gmail(dot)com>
> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 17 Jul 2025 13:44:11 -0700,
> Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> >> > How about adding accessors instead of splitting
> >> > Copy{From,To}State to Copy{From,To}ExecutionData? If we use
> >> > the accessors approach, we can export only needed
> >> > information step by step without breaking ABI.
> >
> > Yeah, while it can export required fields without breaking ABI, I'm
> > concerned that setter and getter functions could be bloated if we need
> > to have them for many fields.
>
> In general, I choose this approach in my projects even when
> I need to define many accessors. Because I can hide
> implementation details from users. I can change
> implementation details without breaking API/ABI.
>
> But PostgreSQL isn't my project. Is there any guideline for
> PostgreSQL API(/ABI?) design that we can refer for this
> case?
As far as I know there is no official guideline for PostgreSQL API and
ABI design, but I've never seen structs having more than 10 getter and
setter in PostgreSQL source code.
>
> FYI: We need to export at least the following fields:
>
> https://www.postgresql.org/message-id/flat/20250714.173803.865595983884510428.kou%40clear-code.com#78fdbccf89742f856aa2cf95eaf42032
>
> > FROM:
> >
> > - attnumlist (*)
> > - bytes_processed
> > - cur_attname
> > - escontext
> > - in_functions (*)
> > - input_buf
> > - input_reached_eof
> > - line_buf
> > - opts (*)
> > - raw_buf
> > - raw_buf_index
> > - raw_buf_len
> > - rel (*)
> > - typioparams (*)
> >
> > TO:
> >
> > - attnumlist (*)
> > - fe_msgbuf
> > - opts (*)
I think we can think of the minimum list of fields that we need to
expose. For instance, fields used for buffered reads for COPY FROM
such as input_buf and raw_buf related fields don't necessarily need to
be exposed as extension can implement it in its own way. We can start
with the supporting simple copy format extensions like that read and
parse the binary data from the data source and fill 'values' and
'nulls' arrays as output. Considering these facts, it might be
sufficient for copy format extensions if they could access 'rel',
'attnumlist', and 'opts' in both COPY FROM and COPY TO (and
CopyFromErrorCallback related fields for COPY FROM).
Apart from this, we might want to reorganize CopyFromStateData fields
and CopyToStateData fields since they have mixed fields of general
purpose fields for COPY operations (e.g., num_defaults, whereClause,
and range_table) and built-in format specific fields (e.g., line_buf
and input_buf). Text and CSV formats are using some fields for parsing
fields with buffered reads so one idea is that we move related fields
to another struct so that both built-in formats (text and CSV) and
external extensions that want to use the buffered reads for text
parsing can use this functionality.
> Here are pros/cons of the Copy{From,To}ExecutionData
> approach, right?
>
> Pros:
> 1. We can hide internal data from extensions
>
> Cons:
> 1. Built-in format routines need to refer fields via
> Copy{From,To}ExecutionData.
> * This MAY has performance impact. If there is no
> performance impact, this is not a cons.
> 2. API/ABI compatibility will be broken when we change
> exported fields.
> * I'm not sure whether this is a cons in the PostgreSQL
> design.
>
> Here are pros/cons of the accessors approach:
>
> Pros:
> 1. We can hide internal data from extensions
> 2. We can export new fields change field names
> without breaking API/ABI compatibility
> 3. We don't need to change built-in format routines.
> So we can assume that there is no performance impact.
>
> Cons:
> 1. We may need to define many accessors
> * I'm not sure whether this is a cons in the PostgreSQL
> design.
I agree with the summary.
> >> Another idea: We'll add Copy{From,To}State::opaque
> >> eventually. (For example, the v40-0003 patch includes it.)
> >>
> >> How about using it to hide fields only for built-in formats?
> >
> > What is the difference between your idea and splitting CopyToState
> > into CopyToState and CopyToExecutionData?
>
> 1. We don't need to manage 2 similar data for built-in
> formats and extensions.
> * Build-in formats use CopyToExecutionData and extensions
> use opaque.
> 2. We can introduce registration API now.
> * We can work on this topic AFTER we introduce
> registration API.
> * e.g.: Add registration API -> Add opaque -> Use opaque
> for internal fields (we will benchmark this
> implementation at this time)
What if we find performance overhead in built-in format cases after
introducing opaque data? I personally would like to avoid merging the
registration API (i.e., supporting custom copy formats) while being
unsure about the overall design ahead and the potential performance
impact by following patches.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-07-28 19:44:11 | Re: should we have a fast-path planning for OLTP starjoins? |
Previous Message | Dave Cramer | 2025-07-28 19:15:23 | Re: More protocol.h replacements this time into walsender.c |