Re: COPY FROM performance improvements

From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: llonergan(at)greenplum(dot)com, pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-06-25 22:52:29
Message-ID: BEE32E3D.616B%agoldshuv@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

> I haven't seen any demand for users to be allowed to specify an escape char
> in text mode (BTW, that's what the docs call what you have called delimited
> mode). Maybe there's a case for it, but I somewhat doubt it.

I ran into users that could not load their data using COPY because their
data included a lot of embedded '\\'... These users would like to chose
another char to be the escape or maybe even chose to disable escaping
alltogher. However, this is probably a separate discussion that we should
think about in the future and see if it's worth introducing.

> As for variable names, choose some variant that seems reasonable. If we need
> to distinguish we should use csv_escapec and text_escapec. But I'm not sure
> we can't use the same variable name for both cases, unless they would have
> conflicting scopes.
Cool. I don't think the scopes are conflicting at the moment.

> What I would like to have is a high level description of
> . how the new text mode code differs from the old text mode code, and
> . which part of the change is responsible for how much performance gain.

Fair enough!

Where the code differs and why it's faster:
-------------------------------------------

The changes in the code only touch the *parsing* parts for *text* format.
All the rest of the COPY process remains the same. More specifically:

CopyFrom() still handles BINARY and CSV, and I extracted TEXT mode to
another function called CopyFromDelimited() (refering to delimited text
format). This function calls the rest of the modified functions below:

CopyGetData() was modified to return the number of bytes actually read
successfully. Rather than asking for a char at a time, the text parsing code
asks for big blocks of data at a time, and needs to know how much data was
accepted into the input buffer (input_buf) for parsing. CopyFrom still can
call CopyGetData and not care about the return value, like it did before.

Why is it faster: Instead of a calling the CopyGetData for every single byte
(involves huge amount of function calls, conditionals, and getc() ) we call
this function once for every buffer (64K) and then start parsing the buffer.

CopyReadLineBuffered is a modified version of CopyReadLine and is used by
CopyFromDelimited to load a line at a time into the line buffer
line_bytebuf. Very similar to CopyReadLine, but:

Why is it faster: beside to reason I explained above (non char-at-a-time
read) this function copies the line of data in it's entirety to the line
buffer, instead of a char-at-a-time with AppendStringInfoMacro. The gain
here is very large. Also, instead of checking for if '\n', if '\r', if '\\',
if '.'... For every character, we only look for the first character of the
line end sequence and when finding it we look backwards or forwards to see
if it's escaped etc..

CopyReadAllAttrs() is the equivalent of CopyReadAttribute(). However it
reads ALL the attributes of the line at once. It stores all of them in a
byte buffer called attr_bytebuf and the offsets to beginnings of the
attributes in attr_offsets array. There is a pretty detailed description in
the function header.

Why is it faster: Here lies probably the biggest improvement. For one thing
all the work is focused (all attributes parsed at once) - less function
calls and conditions. Further, the null mark side is determined ahead of
time to save from multiple calls to strlen (you'll be surprised how slow it
can make things in some occasions). Also, again, the attribute data is
loaded big chunks to the attribute buffer (1 chunk if there are no escapes
in the data) instead of a character at a time. This is another major
speedup. So, after parsing all attributes in one function, functioncall3
will find them by pointing to the attribute array using the attribute
location pointers in attr_offsets.

There is a function (bottom of the file) I wrote that is like the c library
strchr() but instead allows you to not stop scanning when the string has
embedded nulls in it, but stop only at end-of-line characters or end of
input buffer. It's pretty simple and fast, but there may be a way to
optimize it further.

WRT encodings. Conversion to server encoding still happens for every line if
necessary. However checking for mblen only needs to occur if our client
encoding is one of the 5 non-server-supported encodings, not for every case
that the encodings are different (although I don't think it previously
harmed the performance, it probably helped a bit by avoiding extra per
character conditionals). In any case, the strchrlen function skips mb chars
appropriately as well.

If I remember more relevant things I'll post them here.

Hope that helps, please ask more if there are unclear things!

Alon.

>
> Maybe I have missed that in previous discussion, but this change is
> sufficiently invasive that I think you owe that to the reviewers.
>
> cheers
>
> andrew
>
> cheers
>
> andrew
>
>
>

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Luke Lonergan 2005-06-25 22:56:43 Re: COPY FROM performance improvements
Previous Message Luke Lonergan 2005-06-25 22:48:53 Re: COPY FROM performance improvements