Re: Parallel copy

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-11-02 06:14:25
Message-ID: CAA4eK1JxVGPOUs3JgtBqnFA5tYFzaUbjCN2_30CdwBiAJ-Ecmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> Leader process:
>
> The leader process is simple. It picks the next FREE buffer, fills it
> with raw data from the file, and marks it as FILLED. If no buffers are
> FREE, wait.
>
> Worker process:
>
> 1. Claim next READY block from queue, by changing its state to
> PROCESSING. If the next block is not READY yet, wait until it is.
>
> 2. Start scanning the block from 'startpos', finding end-of-line
> markers. (in CSV mode, need to track when we're in-quotes).
>
> 3. When you reach the end of the block, if the last line continues to
> next block, wait for the next block to become FILLED. Peek into the
> next block, and copy the remaining part of the split line to a local
> buffer, and set the 'startpos' on the next block to point to the end
> of the split line. Mark the next block as READY.
>
> 4. Process all the lines in the block, call input functions, insert
> rows.
>
> 5. Mark the block as DONE.
>
> In this design, you don't need to keep line boundaries in shared memory,
> because each worker process is responsible for finding the line
> boundaries of its own block.
>
> There's a point of serialization here, in that the next block cannot be
> processed, until the worker working on the previous block has finished
> scanning the EOLs, and set the starting position on the next block,
> putting it in READY state. That's not very different from your patch,
> where you had a similar point of serialization because the leader
> scanned the EOLs,
>

But in the design (single producer multiple consumer) used by the
patch the worker doesn't need to wait till the complete block is
processed, it can start processing the lines already found. This will
also allow workers to start much earlier to process the data as it
doesn't need to wait for all the offsets corresponding to 64K block
ready. However, in the design where each worker is processing the 64K
block, it can lead to much longer waits. I think this will impact the
Copy STDIN case more where in most cases (200-300 bytes tuples) we
receive line-by-line from client and find the line-endings by leader.
If the leader doesn't find the line-endings the workers need to wait
till the leader fill the entire 64K chunk, OTOH, with current approach
the worker can start as soon as leader is able to populate some
minimum number of line-endings

The other point is that the leader backend won't be used completely as
it is only doing a very small part (primarily reading the file) of the
overall work.

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2020-11-02 06:15:28 Re: Additional Chapter for Tutorial
Previous Message Peter Smith 2020-11-02 05:38:58 Re: extension patch of CREATE OR REPLACE TRIGGER