Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-08 07:00:16
Message-ID: CAJ3gD9cyn+yCLWW56S+v5s5eVEwWZVrdPT25a+Bv9rDYPz8o1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 February 2017 at 14:59, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Feb 17, 2017 at 2:56 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> The log2(num_children)+1 formula which you proposed does not take into
>> account the number of workers for each of the subplans, that's why I
>> am a bit more inclined to look for some other logic. May be, treat the
>> children as if they belong to partitions, and accordingly calculate
>> the final number of workers. So for 2 children with 4 and 5 workers
>> respectively, Append parallel_workers would be : log3(3^4 + 3^5) .
>
> In general this will give an answer not different by more than 1 or 2
> from my answer, and often exactly the same. In the case you mention,
> whether we get the same answer depends on which way you round:
> log3(3^4+3^5) is 5 if you round down, 6 if you round up.
>
> My formula is more aggressive when there are many subplans that are
> not parallel or take only 1 worker, because I'll always use at least 5
> workers for an append that has 9-16 children, whereas you might use
> only 2 if you do log3(3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0). In that
> case I like my formula better. With lots of separate children, the
> chances of being able to use as many as 5 workers seem good. (Note
> that using 9 workers as Ashutosh seems to be proposing would be a
> waste if the different children have very unequal execution times,
> because the workers that run children with short execution times can
> be reused to run additional subplans while the long ones are still
> running. Running a separate worker for each child only works out if
> the shortest runtime is more than 50% of the longest runtime, which
> may sometimes be true but doesn't seem like a good bet in general.)
>
> Your formula is more aggressive when you have 3 children that all use
> the same number of workers; it'll always decide on <number of workers
> per child>+1, whereas mine won't add the extra worker in that case.
> Possibly your formula is better than mine in that case, but I'm not
> sure. If you have as many as 9 children that all want N workers, your
> formula will decide on N+2 workers, but since my formula guarantees a
> minimum of 5 workers in such cases, I'll probably be within 1 of
> whatever answer you were getting.
>

Yeah, that seems to be right in most of the cases. The only cases
where your formula seems to give too few workers is for something like
: (2, 8, 8). For such subplans, we should at least allocate 8 workers.
It turns out that in most of the cases in my formula, the Append
workers allocated is just 1 worker more than the max per-subplan
worker count. So in (2, 1, 1, 8), it will be a fraction more than 8.
So in the patch, in addition to the log2() formula you proposed, I
have made sure that it allocates at least equal to max(per-subplan
parallel_workers values).

>
>> BTW, there is going to be some logic change in the choose-next-subplan
>> algorithm if we consider giving extra workers to subplans.
>
> I'm not sure that it's going to be useful to make this logic very
> complicated. I think the most important thing is to give 1 worker to
> each plan before we give a second worker to any plan. In general I
> think it's sufficient to assign a worker that becomes available to the
> subplan with the fewest number of workers (or one of them, if there's
> a tie) without worrying too much about the target number of workers
> for that subplan.

In the attached v5 patch, the logic of distributing the workers is now
kept simple : it just distributes the workers equally without
considering the per-sublan parallel_workers value. I have retained the
earlier logic of choosing the plan with minimum current workers. But
now that the pa_max_workers is not needed, I removed it, and instead a
partial_plans bitmapset is added in the Append node. Once a worker
picks up a non-partial subplan, it immediately changes its
pa_num_workers to -1. Whereas for partial subplans, the worker sets it
to -1 only after it finishes executing it.

Effectively, in parallel_append_next(), the check for whether subplan
is executing with max parallel_workers is now removed, and all code
that was using pa_max_workers is now removed.

Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> 10. We should probably move the parallel_safe calculation out of cost_append().
> + path->parallel_safe = path->parallel_safe &&
> + subpath->parallel_safe;
>
> 11. This check shouldn't be part of cost_append().
> + /* All child paths must have same parameterization */
> + Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>

Moved out these two statements from cost_append(). Did it separately
in create_append_path().

Also, I have removed some elog() statements which were there while
inside Spinlock in parallel_append_next().

On 17 January 2017 at 11:10, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I was looking at the executor portion of this patch and I noticed that in
> exec_append_initialize_next():
>
> if (appendstate->as_padesc)
> return parallel_append_next(appendstate);
>
> /*
> * Not parallel-aware. Fine, just go on to the next subplan in the
> * appropriate direction.
> */
> if (ScanDirectionIsForward(appendstate->ps.state->es_direction))
> appendstate->as_whichplan++;
> else
> appendstate->as_whichplan--;
>
> which seems to mean that executing Append in parallel mode disregards the
> scan direction. I am not immediately sure what implications that has, so
> I checked what heap scan does when executing in parallel mode, and found
> this in heapgettup():
>
> else if (backward)
> {
> /* backward parallel scan not supported */
> Assert(scan->rs_parallel == NULL);
>
> Perhaps, AppendState.as_padesc would not have been set if scan direction
> is backward, because parallel mode would be disabled for the whole query
> in that case (PlannerGlobal.parallelModeOK = false). Maybe add an
> Assert() similar to one in heapgettup().
>

Right. Thanks for noticing this. I have added a similar Assert in
exec_append_initialize_next().

Attachment Content-Type Size
ParallelAppend_v5.patch application/octet-stream 34.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-03-08 07:28:38 Re: parallel index(-only) scan breaks when run without parallelism
Previous Message Tsunakawa, Takayuki 2017-03-08 06:52:26 [bug fix] dblink leaks unnamed connections