COPY fast parse patch

From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: pgsql-patches(at)postgresql(dot)org
Subject: COPY fast parse patch
Date: 2005-06-01 23:34:37
Message-ID: BEC3941E.5144%agoldshuv@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Here is the patch I was talking about in my message to the "NOLOGGING
option, or ?" thread. I would like to indicate some important points about
it:

1) The patch includes 2 parallel parsing code paths. One is the regular COPY
path that we all know, and the other is the improved one that I wrote. This
is only temporary, as there is a lot of code duplication, but I left it as
such for several purposes:

- The improved path for now supports only ASCII delimited text format with a
condition that client and server encodings are identical. If this is not the
case when running COPY the old path will take place. In other words, under
no condition a "not supported" error will be raised.

- Having both code paths allows for easy performance comparison between the
two. To run the regular COPY parsing call CopyFrom() from DoCopy() and to
run the improved parsing COPY call FastCopyFrom() from DoCopy(). Right now
FastCopyFrom() will be called automatically if all conditions explained in
previous point are met, but it's easy to change (need to re-compile).

* NOTE: the function names Fast*() as ugly as they are, are there to
emphasize the differences between the old and the improved path (i.e:
CopyReadLine() vs. FastReadLine()... ). They are not intended to stay this
way. This is not elegant (yet)!

2) There are some utilities such as bytebuffer and strchrlen that are at the
bottom of the file. This is probably not the right home for them, but for
now to simplify things they are included in copy.c

3) EOL is assumed NL. I raised a point about EOL's in COPY in my previous
thread, and it explains it.

4) Performance numbers could be viewed at
http://lists.pgfoundry.org/pipermail/bizgres-general/2005-May/000135.html
Some numbers include:
8.7MB/sec -> 11.8MB/sec on 15 column (Mixed) table.
12.1MB/sec -> 21MB/sec on 1 column (TEXT) table.

5) Data integrity and escaping improvements. Treats all characters as data
(unless it's an escaped delim or EOL) and therefore data
integrity is preserved. However, some people that already got
used to the postgres COPY escaping way may want to keep it. They could do so
by still using the old COPY.

As a part of submitting this patch I also presented an argument for a use of
a LOAD DATA command (in the NOLOGGING option thread). The points I made
there are closely related to this message. There may be a valid argument
that most of the points I raised could be implemented in the COPY code
instead of a LOAD DATA command, but that requires a great deal of
flexibility to add features and adding them to the COPY syntax. But that may
not be a good idea for some and will also be problematic for backwards
compatiability.

Thx,
Alon.

Attachment Content-Type Size
fast_copy_patch_alon.patch application/octet-stream 41.4 KB

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Mary Edie Meredith 2005-06-02 00:08:14 Re: O_DIRECT for WAL writes
Previous Message Bruce Momjian 2005-06-01 23:28:01 Re: [ADMIN] Config option log_statement