Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-10-19 07:05:05
Message-ID: CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 13 October 2017 at 00:29, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> [ new patch ]
>
> + <entry><literal>parallel_append</></entry>
> + <entry>Waiting to choose the next subplan during Parallel Append plan
> + execution.</entry>
> + </row>
> + <row>
>
> Probably needs to update a morerows values of some earlier entry.

From what I observed from the other places, the morerows value is one
less than the number of following entries. I have changed it to 10
since it has 11 entries.

>
> + <primary><varname>enable_parallelappend</> configuration
> parameter</primary>
>
> How about enable_parallel_append?

Done.

>
> + * pa_finished : workers currently executing the subplan. A worker which
>
> The way the colon is used here is not a standard comment style for PostgreSQL.

Changed it to "pa_finished:".

>
> + * Go on to the "next" subplan. If no more subplans, return the empty
> + * slot set up for us by ExecInitAppend.
> + * Note: Parallel-aware Append follows different logic for choosing the
> + * next subplan.
>
> Formatting looks wrong, and moreover I don't think this is the right
> way of handling this comment anyway. Move the existing comment inside
> the if (!node->padesc) block and leave it unchanged; the else block
> explains the differences for parallel append.

I think the first couple of lines do apply to both parallel-append and
sequential append plans. I have moved the remaining couple of lines
inside the else block.

>
> + * ExecAppendEstimate
> + *
> + * estimates the space required to serialize Append node.
>
> Ugh, this is wrong, but I notice it follows various other
> equally-wrong comments for other parallel-aware node types. I guess
> I'll go fix that. We are not in serializing the Append node.

I didn't clealy get this. Do you think it should be "space required to
copy the Append node into the shared memory" ?

>
> I do not think that it's a good idea to call
> exec_append_parallel_next() from ExecAppendInitializeDSM,
> ExecAppendReInitializeDSM, and ExecAppendInitializeWorker. We want to
> postpone selecting which plan to run until we're actually ready to run
> that plan. Otherwise, for example, the leader might seize a
> non-partial plan (if only such plans are included in the Parallel
> Append) when it isn't really necessary for it to do so. If the
> workers would've reached the plans and started returning tuples to the
> leader before it grabbed a plan, oh well, too bad. The leader's still
> claimed that plan and must now run it.
>
> I concede that's not a high-probability scenario, but I still maintain
> that it is better for processes not to claim a subplan until the last
> possible moment. I think we need to initialize as_whichplan as
> PA_INVALID plan and then fix it when ExecProcNode() is called for the
> first time.

Done. Set as_whichplan to PA_INVALID_PLAN in
ExecAppendInitializeDSM(), ExecAppendReInitializeDSM() and
ExecAppendInitializeWorker(). Then when ExecAppend() is called for the
first time, we notice that as_whichplan is PA_INVALID_PLAN, that means
we need to choose the plan.

>
> + if (!IsParallelWorker())
>
> This is not a great test, because it would do the wrong thing if we
> ever allowed an SQL function called from a parallel worker to run a
> parallel query of its own. Currently that's not allowed but we might
> want to allow it someday. What we really want to test is whether
> we're the leader for *this* query. Maybe use a flag in the
> AppendState for that, and set it correctly in
> ExecAppendInitializeWorker.

Done. Set a new AppendState->is_parallel_worker field to true in
ExecAppendInitializeWorker().

>
> I think maybe the loop in exec_append_parallel_next should look more like this:
>
> /* Pick the next plan. */
> state->as_whichplan = padesc->pa_nextplan;
> if (state->as_whichplan != PA_INVALID_PLAN)
> {
> int nextplan = state->as_whichplan;
>
> /* Mark non-partial plans done immediately so that they can't be
> picked again. */
> if (nextplan < first_partial_plan)
> padesc->pa_finished[nextplan] = true;
>
> /* Figure out what plan the next worker should pick. */
> do
> {
> /* If we've run through all the plans, loop back through
> partial plans only. */
> if (++nextplan >= state->as_nplans)
> nextplan = first_partial_plan;
>
> /* No plans remaining or tried them all? Then give up. */
> if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
> {
> nextplan = PA_INVALID_PLAN;
> break;
> }
> } while (padesc->pa_finished[nextplan]);
>
> /* Store calculated next plan back into shared memory. */
> padesc->pa_next_plan = nextplan;
> }
>
> This might not be exactly right and the comments may need work, but
> here are a couple of points:
>
> - As you have it coded, the loop exit condition is whichplan !=
> PA_INVALID_PLAN, but that's probably an uncommon case and you have two
> other ways out of the loop. It's easier to understand the code if the
> loop condition corresponds to the most common way of exiting the loop,
> and any break is for some corner case.
>
> - Don't need a separate call to exec_append_get_next_plan; it's all
> handled here (and, I think, pretty compactly).

Got rid of exec_append_get_next_plan() and having to do that logic twice.

>
> - No need for pa_first_plan any more. Looping back to
> first_partial_plan is a fine substitute, because by the time anybody
> loops around, pa_first_plan would equal first_partial_plan anyway
> (unless there's a bug).

Yeah, I agree. Got rid of pa_first_plan.

>
> - In your code, the value in shared memory is the point at which to
> start the search for the next plan. Here, I made it the value that
> the next worker should adopt without question.

I was considering this option, but found out that we *have* to return
from exec_append_parallel_next() with this next worker chosen. Now if
the leader happens to reach this plan and finish it, and then for the
workers the padesc->pa_next_plan happens to point to this same plan,
we need to return some other plan.

> Another option would
> be to make it the value that the last worker adopted.

Here, we need to think of an initial value of pa_next_plan when
workers haven't yet started. It can be PA_INVALID_PLAN, but I felt
this does not convey clearly whether none of the plans have started
yet, or all plans have ended.

> I think that
> either that option or what I did above are slightly better than what
> you have, because as you have it, you've got to use the
> increment-with-looping logic in two different places whereas either of
> those options only need it in one place, which makes this a little
> simpler.

The way I have now used the logic more or less looks like the code you
showed above. The differences are :

The padesc->pa_next_plan still points to the plan from which to search
for an unfinished plan. But what's changed is : I keep track of
whichplan and also a nextplan position while searching for the plan.
So even if we find an unfinished plan, there will be a nextplan
pointing appropriately. If the whichplan is a finished one, in the
next iteration nextplan value is assigned to whichplan. This way I
avoided having to separately call the wrap-around logic again outside
of the search loop.

Another additional logic added is : While searching, if whichplan
still points to a non-partial plan, and the backend has already
finished the partial plans and the remaining non-partial plan, then
this condition is not enough to break out of the loop :
> if (++nextplan >= state->as_nplans)
> nextplan = first_partial_plan;
> /* No plans remaining or tried them all? Then give up. */
> if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
> {
> nextplan = PA_INVALID_PLAN;
> break;
> }
This is because, the initial plan with which we started is a
non-partial plan. So above, nextplan never becomes state->as_whichplan
because state->as_whichplan would always be lesser than
first_partial_plan.

So I have split the break condition into two conditions, one of which
is for wrap-around case :

if (whichplan + 1 == state->as_nplans)
{
nextplan = first_partial_plan;
/*
* If we had started from a non-partial plan, that means we have
* searched all the nonpartial and partial plans.
*/
if (initial_plan <= first_partial_plan)
break;
}
else
{
nextplan = whichplan + 1;

/* Have we made a full circle ? */
if (nextplan == initial_plan)
break;
}

Also, we need to consider the possibility that the next plan to be
chosen can even be the same plan that we have started with. This
happens when there is only one unfinished partial plan remaining. So
we should not unconditionally do "nextplan = PA_INVALID_PLAN" if
(nextplan == state->as_whichplan). The changes in the patch considers
this (this was considered also in the previous versions).

Where we set node->as_padesc->pa_finished to true for a partial plan,
I have wrapped it with LWLock lock and release calls. This is
especially because now we use this field while deciding whether the
nextplan is to be set to PA_INVALID_PLAN. I guess this might not be
required for correctness, but it looks less safe with pa_finished
value getting changed while we make decisions depending on it.

Attached v18 patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
ParallelAppend_v18.patch application/octet-stream 60.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-10-19 08:53:28 Re: show "aggressive" or not in autovacuum logs
Previous Message Amit Langote 2017-10-19 06:46:49 Re: path toward faster partition pruning