Re: Parallel copy

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-10-23 08:31:09
Message-ID: 1ca1b86d-1dcd-2be1-9036-1519166434f6@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I had a brief look at at this patch. Important work! A couple of first
impressions:

1. The split between patches
0002-Framework-for-leader-worker-in-parallel-copy.patch and
0003-Allow-copy-from-command-to-process-data-from-file.patch is quite
artificial. All the stuff introduced in the first is unused until the
second patch is applied. The first patch introduces a forward
declaration for ParallelCopyData(), but the function only comes in the
second patch. The comments in the first patch talk about
LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only
comes in the second patch. I think these have to merged into one. If you
want to split it somehow, I'd suggest having a separate patch just to
move CopyStateData from copy.c to copy.h. The subsequent patch would
then be easier to read as you could see more easily what's being added
to CopyStateData. Actually I think it would be better to have a new
header file, copy_internal.h, to hold CopyStateData and the other
structs, and keep copy.h as it is.

2. This desperately needs some kind of a high-level overview of how it
works. What is a leader, what is a worker? Which process does each step
of COPY processing, like reading from the file/socket, splitting the
input into lines, handling escapes, calling input functions, and
updating the heap and indexes? What data structures are used for the
communication? How does is the work synchronized between the processes?
There are comments on those individual aspects scattered in the patch,
but if you're not already familiar with it, you don't know where to
start. There's some of that in the commit message, but it needs to be
somewhere in the source code, maybe in a long comment at the top of
copyparallel.c.

3. I'm surprised there's a separate ParallelCopyLineBoundary struct for
every input line. Doesn't that incur a lot of synchronization overhead?
I haven't done any testing, this is just my gut feeling, but I assumed
you'd work in batches of, say, 100 or 1000 lines each.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-10-23 08:31:56 Re: Online checksums verification in the backend
Previous Message Peter Eisentraut 2020-10-23 08:14:12 Re: warn_unused_results