Quick Links

Re: Make COPY format extendable: Extract COPY TO format implementations

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Sutou Kouhei <kou(at)clear-code(dot)com>
Cc:	michael(at)paquier(dot)xyz, david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Make COPY format extendable: Extract COPY TO format implementations
Date:	2025-07-29 05:19:36
Message-ID:	CAD21AoBa0Wm3C2H12jaqkvLidP2zEhsC+gf=3w7XiA4LQnvx0g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jul 28, 2025 at 12:33 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, Jul 18, 2025 at 3:05 AM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
> >
> > Hi,
> >
> > In <CAD21AoAZL2RzPM4RLOJKm_73z5LXq2_VOVF+S+T0tnbjHdWTFA(at)mail(dot)gmail(dot)com>
> > "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 17 Jul 2025 13:44:11 -0700,
> > Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > >> > How about adding accessors instead of splitting
> > >> > Copy{From,To}State to Copy{From,To}ExecutionData? If we use
> > >> > the accessors approach, we can export only needed
> > >> > information step by step without breaking ABI.
> > >
> > > Yeah, while it can export required fields without breaking ABI, I'm
> > > concerned that setter and getter functions could be bloated if we need
> > > to have them for many fields.
> >
> > In general, I choose this approach in my projects even when
> > I need to define many accessors. Because I can hide
> > implementation details from users. I can change
> > implementation details without breaking API/ABI.
> >
> > But PostgreSQL isn't my project. Is there any guideline for
> > PostgreSQL API(/ABI?) design that we can refer for this
> > case?
>
> As far as I know there is no official guideline for PostgreSQL API and
> ABI design, but I've never seen structs having more than 10 getter and
> setter in PostgreSQL source code.
>
> >
> > FYI: We need to export at least the following fields:
> >
> > https://www.postgresql.org/message-id/flat/20250714.173803.865595983884510428.kou%40clear-code.com#78fdbccf89742f856aa2cf95eaf42032
> >
> > > FROM:
> > >
> > > - attnumlist (*)
> > > - bytes_processed
> > > - cur_attname
> > > - escontext
> > > - in_functions (*)
> > > - input_buf
> > > - input_reached_eof
> > > - line_buf
> > > - opts (*)
> > > - raw_buf
> > > - raw_buf_index
> > > - raw_buf_len
> > > - rel (*)
> > > - typioparams (*)
> > >
> > > TO:
> > >
> > > - attnumlist (*)
> > > - fe_msgbuf
> > > - opts (*)
>
> I think we can think of the minimum list of fields that we need to
> expose. For instance, fields used for buffered reads for COPY FROM
> such as input_buf and raw_buf related fields don't necessarily need to
> be exposed as extension can implement it in its own way. We can start
> with the supporting simple copy format extensions like that read and
> parse the binary data from the data source and fill 'values' and
> 'nulls' arrays as output. Considering these facts, it might be
> sufficient for copy format extensions if they could access 'rel',
> 'attnumlist', and 'opts' in both COPY FROM and COPY TO (and
> CopyFromErrorCallback related fields for COPY FROM).
>
> Apart from this, we might want to reorganize CopyFromStateData fields
> and CopyToStateData fields since they have mixed fields of general
> purpose fields for COPY operations (e.g., num_defaults, whereClause,
> and range_table) and built-in format specific fields (e.g., line_buf
> and input_buf). Text and CSV formats are using some fields for parsing
> fields with buffered reads so one idea is that we move related fields
> to another struct so that both built-in formats (text and CSV) and
> external extensions that want to use the buffered reads for text
> parsing can use this functionality.

So probably it might be worth refactoring the codes in terms of:

1. hiding internal data from format callbacks
2. separating format-specific fields from the main state data.

I categorized the fields in CopyFromStateData. I think there are
roughly three different kind of fields mixed there:

1. fields used only the core (not by format callback)
- filename
- is_program
- whereClause
- cur_relname
- copycontext
- defmap
- num_defaults
- volatile_defexprs
- range_table
- rtrperminfos
- qualexpr
- transition_capture

2. fields used by both the core and format callbacks
- rel
- attnumlist
- cur_lineno
- cur_attname
- cur_attval
- relname_only
- num_errors
- opts
- in_functions
- typioparams
- escontext
- defexprs
- Input-related fields
- copy_src
- copy_file
- fe_msgbuf
- data_source_cb
- byteprocessed

3. built-in format specific fields (mostly for text and csv)
- eol_type
- defaults
- Encoding related fields
- file_encoding
- need_transcoding
- conversion_proc
- convert_select_flags
- raw data pointers
- max_fields
- raw_fields
- attribute_buf
- line_buf related fields
- line_buf
- line_buf_valid
- input_buf related fields
- input_buf
- input_buf_index
- input_buf_len
- input_reached_eof
- input_reached_error
- raw_buf related fields
- raw_buf
- raw_buf_index
- raw_buf_len
- raw_reached_eof

The fields in 1 are mostly static fields, and the fields in 2 and 3
are likely to be accessed in hot functions during COPY FROM. Would it
be a good idea to restructure these fields so that we can hide the
fields in 1 from callback functions and having the fields in 3 in a
separate format-specific struct that can be accessed via an opaque
pointer? But could the latter change potentially cause performance
overheads?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Re: Make COPY format extendable: Extract COPY TO format implementations at 2025-07-28 19:33:28 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zhijie Hou (Fujitsu)	2025-07-29 05:20:51	RE: Conflict detection for update_deleted in logical replication
Previous Message	Amit Kapila	2025-07-29 05:13:32	Re: [Patch] add new parameter to pg_replication_origin_session_setup