Re: COPY FROM performance improvements

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-08-07 04:08:16
Message-ID: 29007.1123387696@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
>> I had some difficulty in generating test cases that weren't largely
>> I/O-bound, but AFAICT the patch as applied is about the same speed
>> as what you submitted.

> You achieve the important objective of knocking the parsing stage down a
> lot, but your parsing code is actually about 20% slower than Alon's.

I would like to see the exact test case you are using to make this
claim; the tests I did suggested my code is the same speed or faster.
The particular test case I was using was the "tenk1" data from the
regression database, duplicated out to about 600K rows so as to run
long enough to measure with some degree of repeatability.

As best I can tell, my version of CopyReadAttributes is significantly
quicker than Alon's, approximately balancing out the fact that my
version of CopyReadLine is slower. I did the latter first, and would
now be tempted to rewrite it in the same style as CopyReadAttributes,
ie one pass of memory-to-memory copy using pointers rather than buffer
indexes.

BTW, late today I figured out a way to get fairly reproducible
non-I/O-bound numbers about COPY FROM: use a trigger that suppresses
the actual inserts, thus:

create table foo ...
create function noway() returns trigger as
'begin return null; end' language plpgsql;
create trigger noway before insert on foo
for each row execute procedure noway();
then repeat:
copy foo from '/tmp/foo.data';

If the source file is not too large to fit in kernel disk cache, then
after the first iteration there is no I/O at all. I got numbers
that were reproducible within less than 1%, as opposed to 5% or more
variation when the thing was partially I/O bound. Pretty useless in the
real world, of course, but great for timing COPY's data-pushing.

regards, tom lane

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 05:21:03 Re: COPY FROM performance improvements
Previous Message Luke Lonergan 2005-08-07 03:25:24 Re: COPY FROM performance improvements

Browse pgsql-performance by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 05:21:03 Re: COPY FROM performance improvements
Previous Message Luke Lonergan 2005-08-07 03:25:24 Re: COPY FROM performance improvements