Re: COPY FROM performance improvements

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-08-07 02:54:04
Message-ID: BF1AC5DC.AA48%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

Tom,

The previous timings were for a table with 15 columns of mixed type. We
also test with 1 column to make the parsing overhead more apparent. In the
case of 1 text column with 145MB of input data:

Your patch:
Time: 6612.599 ms

Alon's patch:
Time: 6119.244 ms

Alon's patch is 7.5% faster here, where it was only 3% faster on the 15
column case. This is consistent with a large difference in parsing speed
between your approach and Alon's.

I'm pretty sure that the "mistake" you refer to is responsible for the speed
improvement, and was deliberately chosen to minimize memory copies, etc.
Given that we're looking ahead to getting much higher speeds, approaching
current high performance disk speeds, we've been looking more closely at the
parsing speed. It comes down to a tradeoff between elegant code and speed.

We'll prove it in lab tests soon, where we measure the parsing rate
directly, but these experiments show it clearly, though indirectly.

- Luke

On 8/6/05 2:04 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> writes:
>> New patch attached. It includes very minor changes. These are changes that
>> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the
>> previous patch.
>
> I've applied this with (rather extensive) revisions. I didn't like what
> you had done with the control structure --- loading the input buffer
> only at the outermost loop level was a bad design choice IMHO. You had
> sprinkled the code with an unreasonable number of special cases in order
> to try to cope with the effects of that mistake, but there were lots
> of problems still left. Some of the bugs I noticed:
>
> * Broke old-protocol COPY, since that has no provision for stopping at
> the EOF marker except by parsing the data carefully to start with. The
> backend would just hang up unless the total data size chanced to be a
> multiple of 64K.
>
> * Subtle change in interpretation of \. EOF marker (the existing code
> will recognize it even when not at start of line).
>
> * Seems to have thrown away detection of newline format discrepancies.
>
> * Fails for zero-column tables.
>
> * Broke display of column values during error context callback (would
> always show the first column contents no matter which one is being
> complained of).
>
> * DetectLineEnd mistakenly assumes CR mode if very last character of first
> bufferload is CR; need to reserve judgment until next char is available.
>
> * DetectLineEnd fails to account for backslashed control characters,
> so it will e.g. accept \ followed by \n as determining the newline
> style.
>
> * Fails to apply encoding conversion if first line exceeds copy buf
> size, because when DetectLineEnd fails the quick-exit path doesn't do
> it.
>
> * There seem to be several bugs associated with the fact that input_buf[]
> always has 64K of data in it even when EOF has been reached on the
> input. One example:
> echo -n 123 >zzz1
> psql> create temp table t1(f1 text);
> psql> copy t1 from '/home/tgl/zzz1';
> psql> select * from t1;
> hmm ... where'd that 64K of whitespace come from?
>
> I rewrote the patch in a way that retained most of the speedups without
> changing the basic control structure (except for replacing multiple
> CopyReadAttribute calls with one CopyReadAttributes call per line).
>
> I had some difficulty in generating test cases that weren't largely
> I/O-bound, but AFAICT the patch as applied is about the same speed
> as what you submitted.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 03:25:24 Re: COPY FROM performance improvements
Previous Message Luke Lonergan 2005-08-07 02:33:07 Re: COPY FROM performance improvements

Browse pgsql-performance by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 03:25:24 Re: COPY FROM performance improvements
Previous Message Luke Lonergan 2005-08-07 02:33:07 Re: COPY FROM performance improvements