Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Make COPY format extendable: Extract COPY TO format implementations
Date: 2023-12-04 06:35:48
Message-ID: 20231204.153548.2126325458835528809.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I want to work on making COPY format extendable. I attach
the first patch for it. I'll send more patches after this is
merged.

Background:

Currently, COPY TO/FROM supports only "text", "csv" and
"binary" formats. There are some requests to support more
COPY formats. For example:

* 2023-11: JSON and JSON lines [1]
* 2022-04: Apache Arrow [2]
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]

(FYI: I want to add support for Apache Arrow.)

There were discussions how to add support for more formats. [3][4]
In these discussions, we got a consensus about making COPY
format extendable.

But it seems that nobody works on this yet. So I want to
work on this. (If there is anyone who wants to work on this
together, I'm happy.)

Summary:

The attached patch introduces CopyToFormatOps struct that is
similar to TupleTableSlotOps for TupleTableSlot but
CopyToFormatOps is for COPY TO format. CopyToFormatOps has
routines to implement a COPY TO format.

The attached patch doesn't change:

* the current behavior (all existing tests are still passed
without changing them)
* the existing "text", "csv" and "binary" format output
implementations including local variable names (the
attached patch just move them and adjust indent)
* performance (no significant loss of performance)

In other words, this is just a refactoring for further
changes to make COPY format extendable. If I use "complete
the task and then request reviews for it" approach, it will
be difficult to review because changes for it will be
large. So I want to work on this step by step. Is it
acceptable?

TODOs that should be done in subsequent patches:

* Add some CopyToState readers such as CopyToStateGetDest(),
CopyToStateGetAttnums() and CopyToStateGetOpts()
(We will need to consider which APIs should be exported.)
(This is for implemeing COPY TO format by extension.)
* Export CopySend*() in src/backend/commands/copyto.c
(This is for implemeing COPY TO format by extension.)
* Add API to register a new COPY TO format implementation
* Add "CREATE XXX" to register a new COPY TO format (or COPY
TO/FROM format) implementation
("CREATE COPY HANDLER" was suggested in [5].)
* Same for COPY FROM

Performance:

We got a consensus about making COPY format extendable but
we should care about performance. [6]

> I think that step 1 ought to be to convert the existing
> formats into plug-ins, and demonstrate that there's no
> significant loss of performance.

So I measured COPY TO time with/without this change. You can
see there is no significant loss of performance.

Data: Random 32 bit integers:

CREATE TABLE data (int32 integer);
INSERT INTO data
SELECT random() * 10000
FROM generate_series(1, ${n_records});

The number of records: 100K, 1M and 10M

100K without this change:

format,elapsed time (ms)
text,22.527
csv,23.822
binary,24.806

100K with this change:

format,elapsed time (ms)
text,22.919
csv,24.643
binary,24.705

1M without this change:

format,elapsed time (ms)
text,223.457
csv,233.583
binary,242.687

1M with this change:

format,elapsed time (ms)
text,224.591
csv,233.964
binary,247.164

10M without this change:

format,elapsed time (ms)
text,2330.383
csv,2411.394
binary,2590.817

10M with this change:

format,elapsed time (ms)
text,2231.307
csv,2408.067
binary,2473.617

[1]: https://www.postgresql.org/message-id/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5
[2]: https://www.postgresql.org/message-id/flat/CAGrfaBVyfm0wPzXVqm0%3Dh5uArYh9N_ij%2BsVpUtDHqkB%3DVyB3jw%40mail.gmail.com
[3]: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com
[4]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d
[5]: https://www.postgresql.org/message-id/20180211211235.5x3jywe5z3lkgcsr%40alap3.anarazel.de
[6]: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us

Thanks,
--
kou

Attachment Content-Type Size
v1-0001-Extract-COPY-TO-format-implementations.patch text/x-patch 17.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message zhihuifan1213 2023-12-04 06:37:02 Avoid detoast overhead when possible
Previous Message John Naylor 2023-12-04 06:34:25 Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)