Re: [PATCH] Performance Improvement For Copy From Binary Files

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Performance Improvement For Copy From Binary Files
Date: 2020-07-09 10:33:42
Message-ID: CALj2ACVe_qKV6ky5bJCtMLbLdU07MPcPvk9iaMfdh3eKHRjWwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> Considering these points, I came up with the attached patch with a
> much smaller footprint. As far as I can see, it implements the same
> basic idea as your patch. With it, I was able to see an improvement
> in loading time consistent with your numbers. I measured the time of
> loading 10 million rows into tables with 5, 10, 20 text columns as
> follows:
>

Thanks Amit for buying into the idea. I agree that your patch looks
clean and simple compared to mine and I'm okay with your patch.

I reviewed and tested your patch, below are few comments:

I think we can remove(and delete the function from the code) the
CopyGetInt16() and have the code directly to save the function call
cost. It gets called for each attribute/column for each row/tuple to
just call CopyReadFromRawBuf() and set the byte order. From a
readability perspective it's okay to have this function, but cost wise
I feel no need for that function at all. In one of our earlier
work(parallel copy), we observed that having a new function or few
extra statements in this copy from path which gets hit for each row,
incurs noticeable execution cost.

The same way, we can also avoid using CopyGetInt32() function call in
CopyReadBinaryAttribute() for the same reason stated above.

In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest"
and use that with (char *) typecast directly, instead of having a
local variable? Though it may/may not be a standard practice, let's
have the parameter name all lower case to keep it consistent with
other function's parameters in the copy.c file.

Seems like there's a bug in below part of the code. Use case is
simple, have some junk value at the end of the binary file, then with
your patch the query succeeds, but it should report the below error.
Here, on fld_count == -1 instead of reading from file, we must be
reading it from the buffer, as we would have already read all the data
from the file into the buffer.
if (cstate->copy_dest != COPY_OLD_FE &&
CopyGetData(cstate, &dummy, 1, 1) > 0)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
errmsg("received copy data after EOF marker")));

I also tried with some intentionally corrupted binary datasets, (apart
from above issue) patch works fine.

For the case where required nbytes may not fit into the buffer in
CopyReadFromRawBuf, I'm sure this can happen only for field data,
(field count , and field size are of fixed length and can fit in the
buffer), instead of reading them in parts of buff size into the buffer
(using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
destination, I think we can detect this condition using requested
bytes and the buffer size and directly read from the file to the
destination buffer and then reload the raw_buffer for further
processing. I think this way, it will be good.

I have few synthesized test cases where fields can be of larger size.
I executed them on your patch, but didn't debug to see whether
actually we hit the code where required nbytes can't fit in the entire
buffer. I will try this on the next version of the patch.

>
> HEAD patched
> foo5 8.5 6.5
> foo10 14 10
> foo20 25 18
>

Numbers might improve a bit, if we remove the extra function calls as
stated above.

Overall, thanks for your suggestions in the previous mail, my patch
was prepared in a bit hurried manner, anyways, will take care next
time.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-07-09 10:42:49 Re: Physical replication slot advance is not persistent
Previous Message Bharath Rupireddy 2020-07-09 09:42:57 Re: Parallel worker hangs while handling errors.