Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-23 10:56:26
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On 23 March 2017 at 05:55, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> Attached is the updated patch that handles the changes for all the
>> comments except the cost changes part. Details about the specific
>> changes are after the cost-related points discussed below.
>> For non-partial paths, I was checking following 3 options :
>> Option 1. Just take the sum of total non-partial child costs and
>> divide it by number of workers. It seems to be getting close to the
>> actual cost.
> If the costs for all children are about equal, then that works fine.
> But when they are very unequal, then it's highly misleading.
>> Option 2. Calculate exact cost by an algorithm which I mentioned
>> before, which is pasted below for reference :
>> Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
>> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
>> times remaining are :
>> 10 6 0 8 3 1
>> After 6 units (minimum of 10, 06, 08), the times remaining are :
>> 4 0 0 2 3 1
>> After 2 units (minimum of 4, 2, 3), the times remaining are :
>> 2 0 0 0 1 1
>> After 1 units (minimum of 2, 1, 1), the times remaining are :
>> 1 0 0 0 0 0
>> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>> 0 0 0 0 0 0
>> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

> This gives the same answer as what I was proposing

Ah I see.

> but I believe it's more complicated to compute.
Yes a bit, particularly because in my algorithm, I would have to do
'n' subtractions each time, in case of 'n' workers. But it looked more
natural because it follows exactly the way we manually calculate.

> The way my proposal would work in this
> case is that we would start with an array C[3] (since there are three
> workers], with all entries 0. Logically C[i] represents the amount of
> work to be performed by worker i. We add each path in turn to the
> worker whose array entry is currently smallest; in the case of a tie,
> just pick the first such entry.
> So in your example we do this:
> C[0] += 20;
> C[1] += 16;
> C[2] += 10;
> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
> path to C[2] */
> C[2] += 8;
> /* after the previous line, C[1] is now the smallest, so add to that
> entry next */
> C[1] += 3;
> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
> C[2] += 1;
> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
> Now we just take the highest entry that appears in any array, which in
> this case is C[0], as the total cost.

Wow. The way your final result exactly tallies with my algorithm
result is very interesting. This looks like some maths or computer
science theory that I am not aware.

I am currently coding the algorithm using your method. Meanwhile
attached is a patch that takes care of your other comments, details of
which are below...

> In my previous review, I said that you should "define and document a
> new builtin tranche ID"; you did the first but not the second. See
> the table in monitoring.sgml.

Yeah, I tried to search how TBM did in the source, but I guess I
didn't correctly run "git grep" commands, so the results did not have
monitoring.sgml, so I thought may be you mean something else by

Added changes in monitoring.sgml now.

> Definition of exec_append_goto_next_plan should have a line break
> after the return type, per usual PostgreSQL style rules.

Oops. Done.

> - * initialize to scan first subplan
> + * In case it's a sequential Append, initialize to scan first subplan.
> This comment is confusing because the code is executed whether it's
> parallel or not. I think it might be better to write something like
> "initialize to scan first subplan (but note that we'll override this
> later in the case of a parallel append)"

> /*
> + * Check if we are already finished plans from parallel append. This
> + * can happen if all the subplans are finished when this worker
> + * has not even started returning tuples.
> + */
> + if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
> There seems to be no reason why this couldn't be hoisted out of the
> loop. Actually, I think Ashutosh pointed this out before, but I
> didn't understand at that time what his point was. Looking back, I
> see that he also pointed out that the as_padesc test isn't necessary,
> which is also true.

I am assuming both yours and Ashutosh's concern is that this check
will be executed for *each* tuple returned, and which needs to be
avoided. Actually, just moving it out of the loop is not going to
solve the runs-for-each-tuple issue. It still will execute for each
tuple. But after a thought, now I agree this can be taken out of loop
anyways, but, not for solving the per-tuple issue, but because it need
not be run for each of the iteration of the loop because that loop is
there to go to the next subplan.

When a worker tries to choose a plan to execute at the very beginning
(i.e in ExecAppendInitializeWorker()), it sometimes finds there is no
plan to execute, because all the others have already taken them and
they are already finished or they are all non-partial plans. In short,
for all subplans, pa_finished = true. So as_whichplan has to be
PA_INVALID_PLAN. To get rid of the extra check in ExecAppend(), in
ExecAppendInitializeWorker() if all plans are finished, we can very
well assign as_whichplan to a partial plan which has already finished,
so that ExecAppend() will execute this finished subplan and just
return NULL. But if all plans are non-partial, we cannot do that.

Now, when ExecAppend() is called, there is no way to know whether this
is the first time ExecProcNode() is executed or not. So we have to
keep on checking the node->as_whichplan == PA_INVALID_PLAN condition.

My earlier response to Ashutosh's feedback on this same point are
pasted below, where there are some possible improvements discussed :

The way ExecProcNode() gets called, there is no different special code
that gets called instead of ExecProcNode() when a tuple is fetched for
the first time. I mean, we cannot prevent ExecProcNode() from getting
called when as_whichplan is invalid right from the beginning.

One thing we can do is : have a special slot in AppenState­>as_plan[]
which has some dummy execution node that just returns NULL tuple, and
initially make as_whichplan point to this slot. But I think it is not
worth doing this.

We can instead reduce the if condition to:
if (node­>as_whichplan == PA_INVALID_PLAN)
Assert(node­>as_padesc != NULL);
return ExecClearTuple(node­>ps.ps_ResultTupleSlot);
BTW, the loop which you mentioned that returns tuples.... the loop is
not for returning tuples, the loop is for iterating to the next
subplan. Even if we take the condition out and keep it in the
beginning of ExecAppend, the issue will remain.

> + if (node->as_padesc)
> + node->as_padesc->pa_finished[node->as_whichplan] = true;
> I think you should move this logic inside exec_append_parallel_next.
> That would avoid testing node->pa_desc an extra time for non-parallel
> append.

Actually exec_append_parallel_next() is called at other places also,
for which we cannot set pa_finished to true inside

But I have done the changes another way. I have taken
exec_append_parallel_next() out of exec_append_initialize_next(), and
put two different conditional code blocks in ExecAppend(), one which
calls set_finished() followed by exec_append_parallel_next() and the
other calls exec_append_initialize_next() (now renamed to

But one thing to note is that this condition is not executed for each
tuple. It is only while going to the next subplan.

> I note that the comment doesn't explain why it's safe to do
> this without taking the lock. I think we could consider doing it with
> the lock held, but it probably is safe, because we're only setting it
> from false to true. If someone else does the same thing, that won't
> hurt anything, and if someone else fails to see our update, then the
> worst-case scenario is that they'll try to execute a plan that has no
> chance of returning any more rows. That's not so bad. Actually,
> looking further, you do have a comment explaining that, but it's in
> exec_append_parallel_next() where the value is used, rather than here.
Yes, right.

> + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
> +
> + shm_toc_insert(pcxt->toc, node->ps.plan->plan_node_id, padesc);
> + node->as_padesc = padesc;
> Putting the shm_toc_insert call after we fully initialize the
> structure seems better than putting it after we've done some of the
> initialization and before we've done the rest.

Done. Also found out that I was memset()ing only pa_finished[]. Now
there is a memset for the whole ParallelAppendDesc structure.

> + /* Choose the optimal subplan to be executed. */
> I think the word "first" would be more accurate than "optimal". We
> can only hope to pick the optimal one, but whichever one we pick is
> definitely the one we're executing first!

> I think the loop in exec_append_parallel_next() is a bit confusing.
> It has three exit conditions, one checked at the top of the loop and
> two other ways to exit via break statements. Sometimes it exits
> because whichplan == PA_INVALID_PLAN was set by
> exec_append_goto_next_plan(), and other times it exits because
> whichplan == initial_plan

Yeah, we cannot bring up the (whichplan == initialplan) to the top in
for(;;) because initially whichplan is initialplan, and we have to
execute the loop at least once (unless whichplan = INVALID).
And we cannot bring down the for condition (which != PA_INVALID_PLAN)
because whichplan can be INVALID right at the beginning if
pa_next_plan itself can be PA_INVALID_PLAN.

> and then it sets whichplan == PA_INVALID_PLAN itself.
It sets that to PA_INVALID_PLAN only when it does not find any next
plan to execute. This is essential to do that particularly because
initiallly when ExecAppendInitialize[Worker/DSM]() is called, it
cannot set as_whichplan to any valid value.

> I feel like this whole function could be written more simply somehow.
Yeah, the main reason it is a bit compilcated is because we are
simulating circular array structure, and that too with an optimization
that we can skip the finished non-partial plans while wrapping around
to the next plan in the circular array. I have tried to add a couple
of more comments.

Also renamed exec_append_goto_next_plan() to
exec_append_get_next_plan() since it is not actually shifting any
counter, it is just returning what is the next counter.

Attachment Content-Type Size
ParallelAppend_v9.patch application/octet-stream 54.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2017-03-23 10:59:37 Re: pgbench - allow to store select results into variables
Previous Message Ashutosh Sharma 2017-03-23 10:56:05 Re: segfault in hot standby for hash indexes