Re: [PATCH] Performance Improvement For Copy From Binary Files

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Performance Improvement For Copy From Binary Files
Date: 2020-07-10 03:21:12
Message-ID: CA+HiwqFF5My6hrrMEEynm+HEhXf5+EySnEcvOxVX0Fmjw13C5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Bharath,

On Thu, Jul 9, 2020 at 7:33 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> Thanks Amit for buying into the idea. I agree that your patch looks
> clean and simple compared to mine and I'm okay with your patch.
>
> I reviewed and tested your patch, below are few comments:

Thanks for checking it out.

> I think we can remove(and delete the function from the code) the
> CopyGetInt16() and have the code directly to save the function call
> cost. It gets called for each attribute/column for each row/tuple to
> just call CopyReadFromRawBuf() and set the byte order. From a
> readability perspective it's okay to have this function, but cost wise
> I feel no need for that function at all. In one of our earlier
> work(parallel copy), we observed that having a new function or few
> extra statements in this copy from path which gets hit for each row,
> incurs noticeable execution cost.
>
> The same way, we can also avoid using CopyGetInt32() function call in
> CopyReadBinaryAttribute() for the same reason stated above.

I agree that removing the function call overhead in this case is worth
the slight loss of readability.

> In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest"
> and use that with (char *) typecast directly, instead of having a
> local variable? Though it may/may not be a standard practice, let's
> have the parameter name all lower case to keep it consistent with
> other function's parameters in the copy.c file.

Agreed.

> Seems like there's a bug in below part of the code. Use case is
> simple, have some junk value at the end of the binary file, then with
> your patch the query succeeds, but it should report the below error.
> Here, on fld_count == -1 instead of reading from file, we must be
> reading it from the buffer, as we would have already read all the data
> from the file into the buffer.
> if (cstate->copy_dest != COPY_OLD_FE &&
> CopyGetData(cstate, &dummy, 1, 1) > 0)
> ereport(ERROR,
> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> errmsg("received copy data after EOF marker")));
>
> I also tried with some intentionally corrupted binary datasets, (apart
> from above issue) patch works fine.

Yeah, I see the bug. I should've checked all the call sites of
CopyGetData() and made sure there is only one left, that is,
CopyLoadRawBuffer().

> For the case where required nbytes may not fit into the buffer in
> CopyReadFromRawBuf, I'm sure this can happen only for field data,
> (field count , and field size are of fixed length and can fit in the
> buffer), instead of reading them in parts of buff size into the buffer
> (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> destination, I think we can detect this condition using requested
> bytes and the buffer size and directly read from the file to the
> destination buffer and then reload the raw_buffer for further
> processing. I think this way, it will be good.

Hmm, I'm afraid that this will make the code more complex for
apparently small benefit. Is this really that much of a problem
performance wise?

> I have few synthesized test cases where fields can be of larger size.
> I executed them on your patch, but didn't debug to see whether
> actually we hit the code where required nbytes can't fit in the entire
> buffer. I will try this on the next version of the patch.
>
> >
> > HEAD patched
> > foo5 8.5 6.5
> > foo10 14 10
> > foo20 25 18
> >
>
> Numbers might improve a bit, if we remove the extra function calls as
> stated above.

Here the numbers with the updated patch:

HEAD patched (v2)
foo5 8.5 6.1
foo10 14 9.4
foo20 25 16.7

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
CopyFrom-binary-buffering_v2.patch application/octet-stream 6.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2020-07-10 03:51:08 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message Masahiko Sawada 2020-07-10 03:16:11 Re: Global snapshots