Re: COPY FROM performance improvements

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-08-06 21:04:52
Message-ID: 26467.1123362292@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> writes:
> New patch attached. It includes very minor changes. These are changes that
> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the
> previous patch.

I've applied this with (rather extensive) revisions. I didn't like what
you had done with the control structure --- loading the input buffer
only at the outermost loop level was a bad design choice IMHO. You had
sprinkled the code with an unreasonable number of special cases in order
to try to cope with the effects of that mistake, but there were lots
of problems still left. Some of the bugs I noticed:

* Broke old-protocol COPY, since that has no provision for stopping at
the EOF marker except by parsing the data carefully to start with. The
backend would just hang up unless the total data size chanced to be a
multiple of 64K.

* Subtle change in interpretation of \. EOF marker (the existing code
will recognize it even when not at start of line).

* Seems to have thrown away detection of newline format discrepancies.

* Fails for zero-column tables.

* Broke display of column values during error context callback (would
always show the first column contents no matter which one is being
complained of).

* DetectLineEnd mistakenly assumes CR mode if very last character of first
bufferload is CR; need to reserve judgment until next char is available.

* DetectLineEnd fails to account for backslashed control characters,
so it will e.g. accept \ followed by \n as determining the newline
style.

* Fails to apply encoding conversion if first line exceeds copy buf
size, because when DetectLineEnd fails the quick-exit path doesn't do
it.

* There seem to be several bugs associated with the fact that input_buf[]
always has 64K of data in it even when EOF has been reached on the
input. One example:
echo -n 123 >zzz1
psql> create temp table t1(f1 text);
psql> copy t1 from '/home/tgl/zzz1';
psql> select * from t1;
hmm ... where'd that 64K of whitespace come from?

I rewrote the patch in a way that retained most of the speedups without
changing the basic control structure (except for replacing multiple
CopyReadAttribute calls with one CopyReadAttributes call per line).

I had some difficulty in generating test cases that weren't largely
I/O-bound, but AFAICT the patch as applied is about the same speed
as what you submitted.

regards, tom lane

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 02:33:07 Re: COPY FROM performance improvements
Previous Message Mark Wong 2005-08-06 21:04:19 Re: [HACKERS] O_DIRECT for WAL writes

Browse pgsql-performance by date

  From Date Subject
Next Message Luke Lonergan 2005-08-07 02:33:07 Re: COPY FROM performance improvements
Previous Message Tom Lane 2005-08-06 14:12:43 Re: Slow update statement