Re: [PATCH] Performance Improvement For Copy From Binary Files

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Performance Improvement For Copy From Binary Files
Date: 2020-07-13 01:19:17
Message-ID: CALj2ACW7mTSMBO5Xmx=p4f9PnuJjF6_PEqjNno8hTewSMM-MeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Thomas for checking this feature.

> On Mon, Jul 13, 2020 at 4:06 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> This error showed up when cfbot tried it:
>
> COPY BINARY stud_emp FROM
> '/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
> +ERROR: could not read from COPY file: Bad address

This is due to the recent commit
cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
raw_buf and line_buf allocations for binary files. Since we are using
raw_buf for this performance improvement feature, now, it's enough to
restrict only line_buf for binary files. I made the changes
accordingly in the v3 patch attached here.

Regression tests(make check & make check-world) ran cleanly with the v3 patch.

Please also find my responses for:

Vignesh's comment:

> On Sun, Jul 12, 2020 at 6:43 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> I had reviewed the changes. I felt one minor change required:
> + * CopyReadFromRawBuf
> + * Reads 'nbytes' bytes from cstate->copy_file via
> cstate->raw_buf and
> + * writes then to 'saveTo'
> + *
> + * Useful when reading binary data from the file.
> Should "writes then to 'saveTo'" be "writes them to 'dest'"?
>

Thanks Vignesh for reviewing the patch. Modified 'saveTo' to 'dest' in v3 patch.

Amit's comment:

>
> > For the case where required nbytes may not fit into the buffer in
> > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > (field count , and field size are of fixed length and can fit in the
> > buffer), instead of reading them in parts of buff size into the buffer
> > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > destination, I think we can detect this condition using requested
> > bytes and the buffer size and directly read from the file to the
> > destination buffer and then reload the raw_buffer for further
> > processing. I think this way, it will be good.
>
> Hmm, I'm afraid that this will make the code more complex for
> apparently small benefit. Is this really that much of a problem
> performance wise?
>

Yes it makes CopyReadFromRawBuf(), code a bit complex from a
readability perspective. I'm convinced not to have the
abovementioned(by me) change, due to 3 reasons,1) the
readability/understandability 2) how many use cases can we have where
requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
cases. I may be wrong here. 3) Performance wise it may not be much as
we do one extra memcpy only in situations where field sizes are
greater than 64KB(as we have already seen and observed by you as well
in one of the response [1]) that memcpy cost for this case may be
negligible.

Considering all of above, I'm okay to have CopyReadFromRawBuf()
function, the way it is currently.

[1]
> >
> > One solution to this problem is to read data from binary file in RAW_BUF_SIZE(64KB) chunks to avoid repeatedly calling fread()(thus possibly avoiding few disk IOs). This is similar to the approach followed for csv/text files.
>
> I agree that having the buffer in front of the file makes sense,
> although we do now have an extra memcpy, that is, from raw_buf to
> attribute_buf.data. Currently, fread() reads directly into
> attribute_buf.data. But maybe that's okay as I don't see the new copy
> being all that bad.
>

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
CopyFrom-binary-buffering_v3.patch application/octet-stream 8.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-07-13 01:48:19 Re: Physical slot restart_lsn advances incorrectly requiring restore from archive
Previous Message Alvaro Herrera 2020-07-13 00:33:10 Re: WIP: BRIN multi-range indexes