Re: Parallel Append implementation

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-16 12:48:38
Message-ID: CAFjFpRc3V47D7Wek72GXHYo6AyA0P-5t9ucNdLxbNyHLvAY+BA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 16, 2017 at 3:57 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> On 12 March 2017 at 08:50, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> However, Ashutosh's response made me think of something: one thing is
>>> that we probably do want to group all of the non-partial plans at the
>>> beginning of the Append so that they get workers first, and put the
>>> partial plans afterward. That's because the partial plans can always
>>> be accelerated by adding more workers as they become available, but
>>> the non-partial plans are just going to take as long as they take - so
>>> we want to start them as soon as possible. In fact, what we might
>>> want to do is actually sort the non-partial paths in order of
>>> decreasing cost, putting the most expensive one first and the others
>>> in decreasing order after that - and then similarly afterward with the
>>> partial paths. If we did that, we wouldn't need to store a bitmapset
>>> OR two separate lists. We could just store the index of the first
>>> partial plan in the list. Then you can test whether a path is partial
>>> by checking whether this_index >= first_partial_index.
>
> Attached is an updated patch v7, which does the above. Now,
> AppendState->subplans has all non-partial subplans followed by all
> partial subplans, with the non-partial subplans in the order of
> descending total cost. Also, for convenience, the AppendPath also now
> has similar ordering in its AppendPath->subpaths. So there is a new
> field both in Append and AppendPath : first_partial_path/plan, which
> has value 0 if there are no non-partial subpaths.
>
> Also the backend now scans reverse, so that it does not take up the
> most expensive path.
>
> There are also some changes in the costing done. Now that we know that
> the very first path is the costliest non-partial path, we can use its
> total cost as the total cost of Append in case all the partial path
> costs are lesser.
>
> Modified/enhanced an existing test scenario in
> src/test/regress/select_parallel.sql so that Parallel Append is
> covered.
>
> As suggested by Robert, since pa_info->pa_finished was the only field
> in pa_info, removed the ParallelAppendDescData.pa_info structure, and
> instead brought pa_info->pa_finished into ParallelAppendDescData.
>
>>>> +static inline void
>>>> +exec_append_scan_first(AppendState *appendstate)
>>>> +{
>>>> + appendstate->as_whichplan = 0;
>>>> +}
>>>>
>>>> I don't think this is buying you anything, and suggest backing it out.
>>>
>>> This is required for sequential Append, so that we can start executing
>>> from the first subplan.
>>
>> My point is that there's really no point in defining a static inline
>> function containing one line of code. You could just put that line of
>> code in whatever places need it, which would probably be more clear.
>
> Did the same.

Some comments
+ * Check if we are already finished plans from parallel append. This
+ * can happen if all the subplans are finished when this worker
+ * has not even started returning tuples.
+ */
+ if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
From the comment, it looks like this condition will be encountered before the
backend returns any tuple. But this code is part of the loop which returns the
tuples. Shouldn't this be outside the loop? Why do we want to check a condition
for every row returned when the condition can happen only once and that too
before returning any tuple?

Why do we need following code in both ExecAppendInitializeWorker() and
ExecAppendInitializeDSM()? Both of those things happen before starting the
actual execution, so one of those should suffice?
+ /* Choose the optimal subplan to be executed. */
+ (void) parallel_append_next(node);

There is no pa_num_worker now, so probably this should get updated. Per comment
we should also get rid of SpinLockAcquire() and SpinLockRelease()?
+ * purpose. The spinlock is used so that it does not change the
+ * pa_num_workers field while workers are choosing the next node.

BTW, sa_finished seems to be a misnomor. The plan is not finished yet, but it
wants no more workers. So, should it be renamed as sa_no_new_workers or
something like that?

In parallel_append_next() we shouldn't need to call goto_next_plan() twice. If
the plan indicated by pa_next_plan is finished, all the plans must have
finished. This should be true if we set pa_next_plan to 0 at the time of
initialization. Any worker picking up pa_next_plan will set it to the next
valid plan. So the next worker asking for plan should pick pa_next_plan and
set it to the next one and so on.

I am wonding whether goto_next_plan() can be simplified as some module
arithmatic e.g. (whichplan - first_plan)++ % (last_plan - first_plan)
+ first_plan.

I am still reviewing the patch.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-03-16 12:51:43 Re: multivariate statistics (v25)
Previous Message Rahila Syed 2017-03-16 12:28:21 Re: wait events for disk I/O