Re: COPY FROM performance improvements

From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: "ALON GOLDSHUV" <agoldshuv(at)greenplum(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: llonergan(at)greenplum(dot)com, pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-06-25 23:04:06
Message-ID: BEE330F6.6171%agoldshuv@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Just remembered another thing.

A struct "bytebuffer" is used instead of a StringInfoData for storing the
line and attributes. A StringInfoData is actually really cool and useful,
but we don't really need it's formatting capabilities in COPY FROM (as far
as I know), and so the bytebuffer is more straightfoward and faster.

Alon.

On 6/25/05 3:52 PM, "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:

>
>
>> I haven't seen any demand for users to be allowed to specify an escape char
>> in text mode (BTW, that's what the docs call what you have called delimited
>> mode). Maybe there's a case for it, but I somewhat doubt it.
>
> I ran into users that could not load their data using COPY because their
> data included a lot of embedded '\\'... These users would like to chose
> another char to be the escape or maybe even chose to disable escaping
> alltogher. However, this is probably a separate discussion that we should
> think about in the future and see if it's worth introducing.
>
>> As for variable names, choose some variant that seems reasonable. If we need
>> to distinguish we should use csv_escapec and text_escapec. But I'm not sure
>> we can't use the same variable name for both cases, unless they would have
>> conflicting scopes.
> Cool. I don't think the scopes are conflicting at the moment.
>
>> What I would like to have is a high level description of
>> . how the new text mode code differs from the old text mode code, and
>> . which part of the change is responsible for how much performance gain.
>
> Fair enough!
>
> Where the code differs and why it's faster:
> -------------------------------------------
>
> The changes in the code only touch the *parsing* parts for *text* format.
> All the rest of the COPY process remains the same. More specifically:
>
> CopyFrom() still handles BINARY and CSV, and I extracted TEXT mode to
> another function called CopyFromDelimited() (refering to delimited text
> format). This function calls the rest of the modified functions below:
>
> CopyGetData() was modified to return the number of bytes actually read
> successfully. Rather than asking for a char at a time, the text parsing code
> asks for big blocks of data at a time, and needs to know how much data was
> accepted into the input buffer (input_buf) for parsing. CopyFrom still can
> call CopyGetData and not care about the return value, like it did before.
>
> Why is it faster: Instead of a calling the CopyGetData for every single byte
> (involves huge amount of function calls, conditionals, and getc() ) we call
> this function once for every buffer (64K) and then start parsing the buffer.
>
> CopyReadLineBuffered is a modified version of CopyReadLine and is used by
> CopyFromDelimited to load a line at a time into the line buffer
> line_bytebuf. Very similar to CopyReadLine, but:
>
> Why is it faster: beside to reason I explained above (non char-at-a-time
> read) this function copies the line of data in it's entirety to the line
> buffer, instead of a char-at-a-time with AppendStringInfoMacro. The gain
> here is very large. Also, instead of checking for if '\n', if '\r', if '\\',
> if '.'... For every character, we only look for the first character of the
> line end sequence and when finding it we look backwards or forwards to see
> if it's escaped etc..
>
> CopyReadAllAttrs() is the equivalent of CopyReadAttribute(). However it
> reads ALL the attributes of the line at once. It stores all of them in a
> byte buffer called attr_bytebuf and the offsets to beginnings of the
> attributes in attr_offsets array. There is a pretty detailed description in
> the function header.
>
> Why is it faster: Here lies probably the biggest improvement. For one thing
> all the work is focused (all attributes parsed at once) - less function
> calls and conditions. Further, the null mark side is determined ahead of
> time to save from multiple calls to strlen (you'll be surprised how slow it
> can make things in some occasions). Also, again, the attribute data is
> loaded big chunks to the attribute buffer (1 chunk if there are no escapes
> in the data) instead of a character at a time. This is another major
> speedup. So, after parsing all attributes in one function, functioncall3
> will find them by pointing to the attribute array using the attribute
> location pointers in attr_offsets.
>
> There is a function (bottom of the file) I wrote that is like the c library
> strchr() but instead allows you to not stop scanning when the string has
> embedded nulls in it, but stop only at end-of-line characters or end of
> input buffer. It's pretty simple and fast, but there may be a way to
> optimize it further.
>
> WRT encodings. Conversion to server encoding still happens for every line if
> necessary. However checking for mblen only needs to occur if our client
> encoding is one of the 5 non-server-supported encodings, not for every case
> that the encodings are different (although I don't think it previously
> harmed the performance, it probably helped a bit by avoiding extra per
> character conditionals). In any case, the strchrlen function skips mb chars
> appropriately as well.
>
> If I remember more relevant things I'll post them here.
>
> Hope that helps, please ask more if there are unclear things!
>
> Alon.
>
>
>
>
>
>>
>> Maybe I have missed that in previous discussion, but this change is
>> sufficiently invasive that I think you owe that to the reviewers.
>>
>> cheers
>>
>> andrew
>>
>> cheers
>>
>> andrew
>>
>>
>>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2005-06-25 23:09:46 Re: COPY FROM performance improvements
Previous Message Luke Lonergan 2005-06-25 22:56:43 Re: COPY FROM performance improvements