Re: Parallel copy

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-18 10:20:40
Message-ID: CAA4eK1+f9QBLc8w5qr3cppcH2vd-frs5N6dP=P0Jv0p++u+Cyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 17, 2020 at 8:34 PM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>
> On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > Good point and I agree with you that having a single process would
> > avoid any such stuff. However, I will think some more on it and if
> > you/anyone else gets some idea on how to deal with this in a
> > multi-worker system (where we can allow each worker to read and
> > process the chunk) then feel free to share your thoughts.
>
> I think having a single process handle splitting the input into tuples makes
> most sense. It's possible to parse csv at multiple GB/s rates [1], finding
> tuple boundaries is a subset of that task.
>
> My first thought for a design would be to have two shared memory ring buffers,
> one for data and one for tuple start positions. Reader process reads the CSV
> data into the main buffer, finds tuple start locations in there and writes
> those to the secondary buffer.
>
> Worker processes claim a chunk of tuple positions from the secondary buffer and
> update their "keep this data around" position with the first position. Then
> proceed to parse and insert the tuples, updating their position until they find
> the end of the last tuple in the chunk.
>

This is something similar to what I had also in mind for this idea. I
had thought of handing over complete chunk (64K or whatever we
decide). The one thing that slightly bothers me is that we will add
some additional overhead of copying to and from shared memory which
was earlier from local process memory. And, the tokenization (finding
line boundaries) would be serial. I think that tokenization should be
a small part of the overall work we do during the copy operation, but
will do some measurements to ascertain the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-02-18 10:29:36 Re: Parallel copy
Previous Message Alexander Korotkov 2020-02-18 10:17:18 Re: Improve search for missing parent downlinks in amcheck