Re: Parallel copy

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-10-27 15:26:41
Message-ID: CALDaNm074KxPB7PbCsE1PkaQPoJUeMtXyT5w0SPszUHmPjPxyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Heikki for reviewing and providing your comments. Please find
my thoughts below.

On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> I had a brief look at at this patch. Important work! A couple of first
> impressions:
>
> 1. The split between patches
> 0002-Framework-for-leader-worker-in-parallel-copy.patch and
> 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite
> artificial. All the stuff introduced in the first is unused until the
> second patch is applied. The first patch introduces a forward
> declaration for ParallelCopyData(), but the function only comes in the
> second patch. The comments in the first patch talk about
> LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only
> comes in the second patch. I think these have to merged into one. If you
> want to split it somehow, I'd suggest having a separate patch just to
> move CopyStateData from copy.c to copy.h. The subsequent patch would
> then be easier to read as you could see more easily what's being added
> to CopyStateData. Actually I think it would be better to have a new
> header file, copy_internal.h, to hold CopyStateData and the other
> structs, and keep copy.h as it is.
>

I have merged 0002 & 0003 patch, I have moved few things like creation
of copy_internal.h, moving of CopyStateData from copy.c into
copy_internal.h into 0001 patch.

> 2. This desperately needs some kind of a high-level overview of how it
> works. What is a leader, what is a worker? Which process does each step
> of COPY processing, like reading from the file/socket, splitting the
> input into lines, handling escapes, calling input functions, and
> updating the heap and indexes? What data structures are used for the
> communication? How does is the work synchronized between the processes?
> There are comments on those individual aspects scattered in the patch,
> but if you're not already familiar with it, you don't know where to
> start. There's some of that in the commit message, but it needs to be
> somewhere in the source code, maybe in a long comment at the top of
> copyparallel.c.
>

Added it in copyparallel.c

> 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for
> every input line. Doesn't that incur a lot of synchronization overhead?
> I haven't done any testing, this is just my gut feeling, but I assumed
> you'd work in batches of, say, 100 or 1000 lines each.
>

Data read from the file will be stored in DSM which is of size 64k *
1024. Leader will parse and identify the line boundary like which line
starts from which data block, what is the starting offset in the data
block, what is the line size, this information will be present in
ParallelCopyLineBoundary. Like you said, each worker processes
WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run
for parallel copy are available at [1]. This is addressed in v9 patch
shared at [2].

[1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2020-10-27 15:33:22 Re: Parallel copy
Previous Message vignesh C 2020-10-27 15:23:39 Re: Parallel copy