Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-08 07:00:16
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On 19 February 2017 at 14:59, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Feb 17, 2017 at 2:56 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> The log2(num_children)+1 formula which you proposed does not take into
>> account the number of workers for each of the subplans, that's why I
>> am a bit more inclined to look for some other logic. May be, treat the
>> children as if they belong to partitions, and accordingly calculate
>> the final number of workers. So for 2 children with 4 and 5 workers
>> respectively, Append parallel_workers would be : log3(3^4 + 3^5) .
> In general this will give an answer not different by more than 1 or 2
> from my answer, and often exactly the same. In the case you mention,
> whether we get the same answer depends on which way you round:
> log3(3^4+3^5) is 5 if you round down, 6 if you round up.
> My formula is more aggressive when there are many subplans that are
> not parallel or take only 1 worker, because I'll always use at least 5
> workers for an append that has 9-16 children, whereas you might use
> only 2 if you do log3(3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0). In that
> case I like my formula better. With lots of separate children, the
> chances of being able to use as many as 5 workers seem good. (Note
> that using 9 workers as Ashutosh seems to be proposing would be a
> waste if the different children have very unequal execution times,
> because the workers that run children with short execution times can
> be reused to run additional subplans while the long ones are still
> running. Running a separate worker for each child only works out if
> the shortest runtime is more than 50% of the longest runtime, which
> may sometimes be true but doesn't seem like a good bet in general.)
> Your formula is more aggressive when you have 3 children that all use
> the same number of workers; it'll always decide on <number of workers
> per child>+1, whereas mine won't add the extra worker in that case.
> Possibly your formula is better than mine in that case, but I'm not
> sure. If you have as many as 9 children that all want N workers, your
> formula will decide on N+2 workers, but since my formula guarantees a
> minimum of 5 workers in such cases, I'll probably be within 1 of
> whatever answer you were getting.

Yeah, that seems to be right in most of the cases. The only cases
where your formula seems to give too few workers is for something like
: (2, 8, 8). For such subplans, we should at least allocate 8 workers.
It turns out that in most of the cases in my formula, the Append
workers allocated is just 1 worker more than the max per-subplan
worker count. So in (2, 1, 1, 8), it will be a fraction more than 8.
So in the patch, in addition to the log2() formula you proposed, I
have made sure that it allocates at least equal to max(per-subplan
parallel_workers values).

>> BTW, there is going to be some logic change in the choose-next-subplan
>> algorithm if we consider giving extra workers to subplans.
> I'm not sure that it's going to be useful to make this logic very
> complicated. I think the most important thing is to give 1 worker to
> each plan before we give a second worker to any plan. In general I
> think it's sufficient to assign a worker that becomes available to the
> subplan with the fewest number of workers (or one of them, if there's
> a tie) without worrying too much about the target number of workers
> for that subplan.

In the attached v5 patch, the logic of distributing the workers is now
kept simple : it just distributes the workers equally without
considering the per-sublan parallel_workers value. I have retained the
earlier logic of choosing the plan with minimum current workers. But
now that the pa_max_workers is not needed, I removed it, and instead a
partial_plans bitmapset is added in the Append node. Once a worker
picks up a non-partial subplan, it immediately changes its
pa_num_workers to -1. Whereas for partial subplans, the worker sets it
to -1 only after it finishes executing it.

Effectively, in parallel_append_next(), the check for whether subplan
is executing with max parallel_workers is now removed, and all code
that was using pa_max_workers is now removed.

Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> 10. We should probably move the parallel_safe calculation out of cost_append().
> + path->parallel_safe = path->parallel_safe &&
> + subpath->parallel_safe;
> 11. This check shouldn't be part of cost_append().
> + /* All child paths must have same parameterization */
> + Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));

Moved out these two statements from cost_append(). Did it separately
in create_append_path().

Also, I have removed some elog() statements which were there while
inside Spinlock in parallel_append_next().

On 17 January 2017 at 11:10, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I was looking at the executor portion of this patch and I noticed that in
> exec_append_initialize_next():
> if (appendstate->as_padesc)
> return parallel_append_next(appendstate);
> /*
> * Not parallel-aware. Fine, just go on to the next subplan in the
> * appropriate direction.
> */
> if (ScanDirectionIsForward(appendstate->ps.state->es_direction))
> appendstate->as_whichplan++;
> else
> appendstate->as_whichplan--;
> which seems to mean that executing Append in parallel mode disregards the
> scan direction. I am not immediately sure what implications that has, so
> I checked what heap scan does when executing in parallel mode, and found
> this in heapgettup():
> else if (backward)
> {
> /* backward parallel scan not supported */
> Assert(scan->rs_parallel == NULL);
> Perhaps, AppendState.as_padesc would not have been set if scan direction
> is backward, because parallel mode would be disabled for the whole query
> in that case (PlannerGlobal.parallelModeOK = false). Maybe add an
> Assert() similar to one in heapgettup().

Right. Thanks for noticing this. I have added a similar Assert in

Attachment Content-Type Size
ParallelAppend_v5.patch application/octet-stream 34.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-03-08 07:28:38 Re: parallel index(-only) scan breaks when run without parallelism
Previous Message Tsunakawa, Takayuki 2017-03-08 06:52:26 [bug fix] dblink leaks unnamed connections