Re: COPY FROM performance improvements

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-08-07 02:33:07
Message-ID: BF1AC0F3.AA43%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

Tom,

Thanks for finding the bugs and reworking things.

> I had some difficulty in generating test cases that weren't largely
> I/O-bound, but AFAICT the patch as applied is about the same speed
> as what you submitted.

You achieve the important objective of knocking the parsing stage down a
lot, but your parsing code is actually about 20% slower than Alon's.

Before your patch:
Time: 14205.606 ms

With your patch:
Time: 10565.374 ms

With Alon's patch:
Time: 10289.845 ms

The parsing part of the code in your version is slower, but as a percentage
of the total it's hidden. The loss of 0.3 seconds on 143MB means:

- If parsing takes a total of 0.9 seconds, the parsing rate is 160MB/s
(143/0.9)

- If we add another 0.3 seconds to parsing to bring it to 1.2, then the
parsing rate becomes 120MB/s

When we improve the next stages of the processing (attribute conversion,
write-to disk), this will stand out a lot more. Our objective is to get the
COPY rate *much* faster than the current poky rate of 14MB/s (after this
patch).

- Luke

On 8/6/05 2:04 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> writes:
>> New patch attached. It includes very minor changes. These are changes that
>> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the
>> previous patch.
>
> I've applied this with (rather extensive) revisions. I didn't like what
> you had done with the control structure --- loading the input buffer
> only at the outermost loop level was a bad design choice IMHO. You had
> sprinkled the code with an unreasonable number of special cases in order
> to try to cope with the effects of that mistake, but there were lots
> of problems still left. Some of the bugs I noticed:
>
> * Broke old-protocol COPY, since that has no provision for stopping at
> the EOF marker except by parsing the data carefully to start with. The
> backend would just hang up unless the total data size chanced to be a
> multiple of 64K.
>
> * Subtle change in interpretation of \. EOF marker (the existing code
> will recognize it even when not at start of line).
>
> * Seems to have thrown away detection of newline format discrepancies.
>
> * Fails for zero-column tables.
>
> * Broke display of column values during error context callback (would
> always show the first column contents no matter which one is being
> complained of).
>
> * DetectLineEnd mistakenly assumes CR mode if very last character of first
> bufferload is CR; need to reserve judgment until next char is available.
>
> * DetectLineEnd fails to account for backslashed control characters,
> so it will e.g. accept \ followed by \n as determining the newline
> style.
>
> * Fails to apply encoding conversion if first line exceeds copy buf
> size, because when DetectLineEnd fails the quick-exit path doesn't do
> it.
>
> * There seem to be several bugs associated with the fact that input_buf[]
> always has 64K of data in it even when EOF has been reached on the
> input. One example:
> echo -n 123 >zzz1
> psql> create temp table t1(f1 text);
> psql> copy t1 from '/home/tgl/zzz1';
> psql> select * from t1;
> hmm ... where'd that 64K of whitespace come from?
>
> I rewrote the patch in a way that retained most of the speedups without
> changing the basic control structure (except for replacing multiple
> CopyReadAttribute calls with one CopyReadAttributes call per line).
>
> I had some difficulty in generating test cases that weren't largely
> I/O-bound, but AFAICT the patch as applied is about the same speed
> as what you submitted.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 02:54:04 Re: COPY FROM performance improvements
Previous Message Tom Lane 2005-08-06 21:04:52 Re: COPY FROM performance improvements

Browse pgsql-performance by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 02:54:04 Re: COPY FROM performance improvements
Previous Message Tom Lane 2005-08-06 21:04:52 Re: COPY FROM performance improvements