Re: Parallel copy

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-14 10:05:48
Message-ID: CA+hUKGK3Wsu0Rtofad21YM4wVcjR35pV1s1fPpUYOEBDGCXySA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> This work is to parallelize the copy command and in particular "Copy
> <table_name> from 'filename' Where <condition>;" command.

Nice project, and a great stepping stone towards parallel DML.

> The first idea is that we allocate each chunk to a worker and once the
> worker has finished processing the current chunk, it can start with
> the next unprocessed chunk. Here, we need to see how to handle the
> partial tuples at the end or beginning of each chunk. We can read the
> chunks in dsa/dsm instead of in local buffer for processing.
> Alternatively, if we think that accessing shared memory can be costly
> we can read the entire chunk in local memory, but copy the partial
> tuple at the beginning of a chunk (if any) to a dsa. We mainly need
> partial tuple in the shared memory area. The worker which has found
> the initial part of the partial tuple will be responsible to process
> the entire tuple. Now, to detect whether there is a partial tuple at
> the beginning of a chunk, we always start reading one byte, prior to
> the start of the current chunk and if that byte is not a terminating
> line byte, we know that it is a partial tuple. Now, while processing
> the chunk, we will ignore this first line and start after the first
> terminating line.

That's quiet similar to the approach I took with a parallel file_fdw
patch[1], which mostly consisted of parallelising the reading part of
copy.c, except that...

> To connect the partial tuple in two consecutive chunks, we need to
> have another data structure (for the ease of reference in this email,
> I call it CTM (chunk-tuple-map)) in shared memory where we store some
> per-chunk information like the chunk-number, dsa location of that
> chunk and a variable which indicates whether we can free/reuse the
> current entry. Whenever we encounter the partial tuple at the
> beginning of a chunk we note down its chunk number, and dsa location
> in CTM. Next, whenever we encounter any partial tuple at the end of
> the chunk, we search CTM for next chunk-number and read from
> corresponding dsa location till we encounter terminating line byte.
> Once we have read and processed this partial tuple, we can mark the
> entry as available for reuse. There are some loose ends here like how
> many entries shall we allocate in this data structure. It depends on
> whether we want to allow the worker to start reading the next chunk
> before the partial tuple of the previous chunk is processed. To keep
> it simple, we can allow the worker to process the next chunk only when
> the partial tuple in the previous chunk is processed. This will allow
> us to keep the entries equal to a number of workers in CTM. I think
> we can easily improve this if we want but I don't think it will matter
> too much as in most cases by the time we processed the tuples in that
> chunk, the partial tuple would have been consumed by the other worker.

... I didn't use a shm 'partial tuple' exchanging mechanism, I just
had each worker follow the final tuple in its chunk into the next
chunk, and have each worker ignore the first tuple in chunk after
chunk 0 because it knows someone else is looking after that. That
means that there was some double reading going on near the boundaries,
and considering how much I've been complaining about bogus extra
system calls on this mailing list lately, yeah, your idea of doing a
bit more coordination is a better idea. If you go this way, you might
at least find the copy.c part of the patch I wrote useful as stand-in
scaffolding code in the meantime while you prototype the parallel
writing side, if you don't already have something better for this?

> Another approach that came up during an offlist discussion with Robert
> is that we have one dedicated worker for reading the chunks from file
> and it copies the complete tuples of one chunk in the shared memory
> and once that is done, a handover that chunks to another worker which
> can process tuples in that area. We can imagine that the reader
> worker is responsible to form some sort of work queue that can be
> processed by the other workers. In this idea, we won't be able to get
> the benefit of initial tokenization (forming tuple boundaries) via
> parallel workers and might need some additional memory processing as
> after reader worker has handed the initial shared memory segment, we
> need to somehow identify tuple boundaries and then process them.

Yeah, I have also wondered about something like this in a slightly
different context. For parallel query in general, I wondered if there
should be a Parallel Scatter node, that can be put on top of any
parallel-safe plan, and it runs it in a worker process that just
pushes tuples into a single-producer multi-consumer shm queue, and
then other workers read from that whenever they need a tuple. Hmm,
but for COPY, I suppose you'd want to push the raw lines with minimal
examination, not tuples, into a shm queue, so I guess that's a bit
different.

> Another thing we need to figure out is the how many workers to use for
> the copy command. I think we can use it based on the file size which
> needs some experiments or may be based on user input.

It seems like we don't even really have a general model for that sort
of thing in the rest of the system yet, and I guess some kind of
fairly dumb explicit system would make sense in the early days...

> Thoughts?

This is cool.

[1] https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2020-02-14 10:35:59 Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Previous Message Pengzhou Tang 2020-02-14 09:59:39 Re: Extracting only the columns needed for a query