Re: Make COPY format extendable: Extract COPY TO format implementations

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: andrew(at)dunslane(dot)net, michael(at)paquier(dot)xyz, zhjwpku(at)gmail(dot)com, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-01-25 04:36:03
Message-ID: CAD21AoALxEZz33NpcSk99ad_DT3A2oFNMa2KNjGBCMVFeCiUaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 24, 2024 at 11:17 PM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
>
> Hi,
>
> In <10025bac-158c-ffe7-fbec-32b42629121f(at)dunslane(dot)net>
> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 24 Jan 2024 07:15:55 -0500,
> Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
> >
> > On 2024-01-24 We 03:11, Michael Paquier wrote:
> >> On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote:
> >>> For COPY TO:
> >>>
> >>> 0001: This adds CopyToRoutine and use it for text/csv/binary
> >>> formats. No implementation change. This just move codes.
> >> 10M without this change:
> >>
> >> format,elapsed time (ms)
> >> text,1090.763
> >> csv,1136.103
> >> binary,1137.141
> >>
> >> 10M with this change:
> >>
> >> format,elapsed time (ms)
> >> text,1082.654
> >> csv,1196.991
> >> binary,1069.697
> >>
> >> These numbers point out that binary is faster by 6%, csv is slower by
> >> 5%, while text stays around what looks like noise range. That's not
> >> negligible. Are these numbers reproducible? If they are, that could
> >> be a problem for anybody doing bulk-loading of large data sets. I am
> >> not sure to understand where the improvement for binary comes from by
> >> reading the patch, but perhaps perf would tell more for each format?
> >> The loss with csv could be blamed on the extra manipulations of the
> >> function pointers, likely.
> >
> >
> > I don't think that's at all acceptable.
> >
> > We've spent quite a lot of blood sweat and tears over the years to make COPY
> > fast, and we should not sacrifice any of that lightly.
>
> These numbers aren't reproducible. Because these benchmarks
> executed on my normal machine not a machine only for
> benchmarking. The machine runs another processes such as
> editor and Web browser.
>
> For example, here are some results with master
> (94edfe250c6a200d2067b0debfe00b4122e9b11e):
>
> Format,N records,Elapsed time (ms)
> csv,10000000,1073.715
> csv,10000000,1022.830
> csv,10000000,1073.584
> csv,10000000,1090.651
> csv,10000000,1052.259
>
> Here are some results with master + the 0001 patch:
>
> Format,N records,Elapsed time (ms)
> csv,10000000,1025.356
> csv,10000000,1067.202
> csv,10000000,1014.563
> csv,10000000,1032.088
> csv,10000000,1058.110
>
>
> I uploaded my benchmark script so that you can run the same
> benchmark on your machine:
>
> https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5
>
> Could anyone try the benchmark with master and master+0001?
>

I've run a similar scenario:

create unlogged table test (a int);
insert into test select c from generate_series(1, 25000000) c;
copy test to '/tmp/result.csv' with (format csv); -- generates 230MB file

I've run it on HEAD and HEAD+0001 patch and here are the medians of 10
executions for each format:

HEAD:
binary 2930.353 ms
text 2754.852 ms
csv 2890.012 ms

HEAD w/ 0001 patch:
binary 2814.838 ms
text 2900.845 ms
csv 3015.210 ms

Hmm I can see a similar trend that Suto-san had; the binary format got
slightly faster whereas both text and csv format has small regression
(4%~5%). I think that the improvement for binary came from the fact
that we removed "if (cstate->opts.binary)" branches from the original
CopyOneRowTo(). I've experimented with a similar optimization for csv
and text format; have different callbacks for text and csv format and
remove "if (cstate->opts.csv_mode)" branches. I've attached a patch
for that. Here are results:

HEAD w/ 0001 patch + remove branches:
binary 2824.502 ms
text 2715.264 ms
csv 2803.381 ms

The numbers look better now. I'm not sure these are within a noise
range but it might be worth considering having different callbacks for
text and csv formats.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
add_callback_for_csv_format.patch application/octet-stream 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Samokhvalov 2024-01-25 04:40:30 Re: UUID v7
Previous Message Zhijie Hou (Fujitsu) 2024-01-25 04:24:58 RE: Synchronizing slots from primary to standby