| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Sutou Kouhei <kou(at)clear-code(dot)com>, andres(at)anarazel(dot)de |
| Cc: | michael(at)paquier(dot)xyz, david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Make COPY format extendable: Extract COPY TO format implementations |
| Date: | 2025-11-14 20:19:47 |
| Message-ID: | CAD21AoDCEfe0PQhMEx8G1rpS7RrzGCJPobeqm3Mpn2bgbUH9nQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Oct 29, 2025 at 1:41 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Oct 13, 2025 at 7:15 PM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
> >
> > Hi,
> >
> > In <CAD21AoBkA=g=PN17r_iieru+vLyLtGZ8WvohgANa2vzsMfMogQ(at)mail(dot)gmail(dot)com>
> > "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 13 Oct 2025 14:40:31 -0700,
> > Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > > The patch refactors the CopyToStateData so that we can both hide
> > > internal-use-only fields from extensions and extension can use its own
> > > state data, while not adding extra indirection layers. TBH I'm really
> > > not sure we must fully hide internal fields from extensions. Other
> > > extendable components seem not to strictly hide internal information
> > > from extensions. I'd suggest starting with only the latter point. That
> > > is, we merge fields in CopyToStateInternalData to CopyToStateData.
> > > What do you think?
> >
> > OK. Let's follow the existing style. How about the attached
> > patch? It merges CopyToStateInternalData to CopyToStateData.
> >
>
> The basic idea of this patch makes sense to me.
>
This thread has involved extensive discussion, and the patch needs to
be rebased. I'd like to summarize the current status of this patch and
our discussions. I've attached updated patches that implement the
whole ideas of this feature to help provide a clearer overall picture.
In commits 2e4127b6d and 7717f6300, we refactored COPY TO/FROM code to
use a set of callbacks for format-specific operations like data
parsing. These callbacks are currently not exposed and are only used
by built-in formats. The next step is to allow extensions to register
their own format implementations, which these attached patches
accomplish. The registration API can be called in _PG_init() as
follows to enable users to specify a custom format name ('jsonlines'
in this example) in the FORMAT option:
RegisterCopyCustomFormat("jsonlines",
&JsonLinesCopyFromRoutine,
&JsonLinesCopyToRoutine);
However, before introducing the registration API, we need to resolve
an issue with how we currently use a monolithic struct
(Copy{From,To}StateData) to store COPY TO/FROM state data. This struct
currently contains both format-agnostic fields (e.g., target relation
and source file) and format-specific fields (e.g., input buffers and
EOL type). Patches 0001 and 0002 reorganize these fields to separate
them. Specifically, format-specific fields are moved to a new struct
while Copy{From,To}StateData retains format-agnostic fields, as shown
here (e.g., COPY TO case):
typedef struct CopyToStateData
{
/* format-specific routines */
const struct CopyToRoutine *routine;
/* low-level state data */
CopyDest copy_dest; /* type of copy source/destination */
FILE *copy_file; /* used if copy_dest == COPY_DEST_FILE */
StringInfo fe_msgbuf; /* used for all dests during COPY TO */
(snip)
/*
* Working state
*/
MemoryContext copycontext; /* per-copy execution context */
FmgrInfo *out_functions; /* lookup info for output functions */
MemoryContext rowcontext; /* per-row evaluation context */
uint64 bytes_processed; /* number of bytes processed so far */
} CopyToStateData;
typedef struct CopyToStateTextLike
{
CopyToStateData base; /* embedded */
int file_encoding;
bool need_transcoding;
bool encoding_embeds_ascii;
} CopyToStateTextLike;
Extensions must specify their required state struct size (like
CopyToStateTextLike for built-in formats) using new callbacks
Copy{From,To}EstimateStateSpace, allowing the core to allocate the
appropriate amount of memory. This approach offers two advantages:
- Format processing implementations only use the memory they need
- No additional pointer traversal compared to using an opaque pointer
for format-specific data
Patches 0003 through 0006 implement the following:
0003: Introduces the RegisterCustomCopyFormat() API
0004 and 0005: Enable custom format implementations to register their
own COPY command options
0006: Adds regression tests
With these patches, here's what we can do using the 'jsonlines'
format extension:
-- COPY TO
CREATE TABLE jl (id int, a text, b jsonb);
INSERT INTO jl VALUES (1, 'hello', '{"test" : [1, true, {"num" :
42}]}'::jsonb), (2, 'hello world', 'true'), (999, null, '{"a" : 1}');
TABLE jl;
id | a | b
-----+-------------+----------------------------------
1 | hello | {"test": [1, true, {"num": 42}]}
2 | hello world | true
999 | | {"a": 1}
(3 rows)
COPY jl TO '/tmp/test.jsonl' WITH (format 'jsonlines');
\! cat /tmp/test.jsonl
{"id":1,"a":"hello","b":{"test": [1, true, {"num": 42}]}}
{"id":2,"a":"hello world","b":true}
{"id":999,"a":null,"b":{"a": 1}}
-- COPY FROM
CREATE TABLE jl_load (id int, a text, b jsonb);
COPY jl_load FROM '/tmp/test.jsonl' WITH (format 'jsonlines');
-- COPY TO/FROM with custom options
COPY jl TO '/tmp/jl.jsonl.gz' WITH (format 'jsonlines', compression
'gzip', compression_detail 'level=2');
COPY jl FROM '/tmp/jl.jsonl.gz' WITH (format 'jsonlines');
To demonstrate the functionality of both current and new APIs,
Suto-san and I have created several experimental custom COPY format
extensions:
Apache Arrow (developed by Sutou-san): https://github.com/kou/pg-copy-arrow
JSON Lines (developed by me):
https://github.com/MasahikoSawada/pg_custom_copy_formats/blob/master/jsonlines.c
These implementations serve as good examples of how extensions can use
these APIs to define custom COPY formats.
After offline discussions with Sutou-san, we believe the current APIs
work well, particularly for text-based formats, though we still need
to verify there are no performance regressions.
One potential improvement would be adding support for random file
access in COPY FROM operations. For example, with parquet files, it
would be much more efficient to read the footer section first since it
contains metadata, allowing selective reading of necessary file
sections. The current sequential read API (CopyFromGetData()) requires
reading all data to access the metadata.
For future consideration, we could look into supporting file
reading/writing from external sources like S3. While this is outside
the scope of this patch, we discussed that allowing the core to
delegate I/O operations to custom format implementations might be a
good starting point. We can discuss this in a separate thread.
I welcome your feedback on these proposed changes and APIs to help
move this patch forward.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size |
|---|---|---|
| 0003-Support-custom-COPY-format-for-COPY-FROM-and-COPY-TO.patch | application/octet-stream | 9.6 KB |
| 0006-Add-regression-test-module-for-custom-COPY-format.patch | application/octet-stream | 17.0 KB |
| 0004-Refactor-Copy-option-processing.patch | application/octet-stream | 6.9 KB |
| 0005-Add-Copy-To-From-ProcessOneOption-callback.patch | application/octet-stream | 5.9 KB |
| 0002-Separate-format-specific-fields-from-CopyFromStateDa.patch | application/octet-stream | 55.2 KB |
| 0001-Separate-format-specific-fields-from-CopyToStateData.patch | application/octet-stream | 21.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-11-14 20:25:15 | Re: postgresql.conf.sample tab width |
| Previous Message | Nathan Bossart | 2025-11-14 20:13:04 | postgresql.conf.sample tab width |