Re: Parallel copy

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-19 10:38:45
Message-ID: 20200219103845.7rwdqe43z327sp3z@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 19, 2020 at 11:02:15AM +0200, Ants Aasma wrote:
>On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>> >
>> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> > >
>> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>> > > >
>> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> > > > > This is something similar to what I had also in mind for this idea. I
>> > > > > had thought of handing over complete chunk (64K or whatever we
>> > > > > decide). The one thing that slightly bothers me is that we will add
>> > > > > some additional overhead of copying to and from shared memory which
>> > > > > was earlier from local process memory. And, the tokenization (finding
>> > > > > line boundaries) would be serial. I think that tokenization should be
>> > > > > a small part of the overall work we do during the copy operation, but
>> > > > > will do some measurements to ascertain the same.
>> > > >
>> > > > I don't think any extra copying is needed.
>> > > >
>> > >
>> > > I am talking about access to shared memory instead of the process
>> > > local memory. I understand that an extra copy won't be required.
>> > >
>> > > > The reader can directly
>> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
>> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
>> > > >
>> > >
>> > > I am slightly confused here. AFAIU, the for(;;) loop in
>> > > CopyReadLineText is about finding the line endings which we thought
>> > > that the reader process will do.
>> >
>> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
>> > currently copies data from cstate->raw_buf to the StringInfo in
>> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer
>> > to local line_buf until it hits the line end found by the data reader. The
>> > amount of copying done is still exactly the same as it is now.
>> >
>>
>> Yeah, on a broader level it will be something like that, but actual
>> details might vary during implementation. BTW, have you given any
>> thoughts on one other approach I have shared above [1]? We might not
>> go with that idea, but it is better to discuss different ideas and
>> evaluate their pros and cons.
>>
>> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com
>
>It seems to be that at least for the general CSV case the tokenization to
>tuples is an inherently serial task. Adding thread synchronization to that path
>for coordinating between multiple workers is only going to make it slower. It
>may be possible to enforce limitations on the input (e.g. no quotes allowed) or
>do some speculative tokenization (e.g. if we encounter quote before newline
>assume the chunk started in a quoted section) to make it possible to do the
>tokenization in parallel. But given that the simpler and more featured approach
>of handling it in a single reader process looks to be fast enough, I don't see
>the point. I rather think that the next big step would be to overlap reading
>input and tokenization, hopefully by utilizing Andres's work on asyncio.
>

I generally agree with the impression that parsing CSV is tricky and
unlikely to benefit from parallelism in general. There may be cases with
restrictions making it easier (e.g. restrictions on the format) but that
might be a bit too complex to start with.

For example, I had an idea to parallelise the planning by splitting it
into two phases:

1) indexing

Splits the CSV file into equally-sized chunks, make each worker to just
scan through it's chunk and store positions of delimiters, quotes,
newlines etc. This is probably the most expensive part of the parsing
(essentially go char by char), and we'd speed it up linearly.

2) merge

Combine the information from (1) in a single process, and actually parse
the CSV data - we would not have to inspect each character, because we'd
know positions of interesting chars, so this should be fast. We might
have to recheck some stuff (e.g. escaping) but it should still be much
faster.

But yes, this may be a bit complex and I'm not sure it's worth it.

The one piece of information I'm missing here is at least a very rough
quantification of the individual steps of CSV processing - for example
if parsing takes only 10% of the time, it's pretty pointless to start by
parallelising this part and we should focus on the rest. If it's 50% it
might be a different story. Has anyone done any measurements?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2020-02-19 10:41:09 Re: Updating row and width estimates in postgres_fdw
Previous Message Ants Aasma 2020-02-19 09:02:15 Re: Parallel copy