Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-13 08:59:47
Message-ID: CAJ3gD9fKQKk2f8VbCqwzfo5Xzej+Asbjqs+mKm2AHFUrFXdiiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10 March 2017 at 22:08, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 10, 2017 at 12:17 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> I agree that the two-lists approach will consume less memory than
>> bitmapset. Keeping two lists will effectively have an extra pointer
>> field which will add up to the AppendPath size, but this size will not
>> grow with the number of subpaths, whereas the Bitmapset will grow.
>
> Sure. You'll use about one BIT of memory per subpath. I'm kind of
> baffled as to why we're treating this as an issue worth serious
> discussion; the amount of memory involved is clearly very small. Even
> for an appendrel with 1000 children, that's 125 bytes of memory.
> Considering the amount of memory we're going to spend planning that
> appendrel overall, that's not significant.
Yes, I agree that we should consider rather other things like code
simplicity to determine which data structure we should use in
AppendPath.

>
> However, Ashutosh's response made me think of something: one thing is
> that we probably do want to group all of the non-partial plans at the
> beginning of the Append so that they get workers first, and put the
> partial plans afterward. That's because the partial plans can always
> be accelerated by adding more workers as they become available, but
> the non-partial plans are just going to take as long as they take - so
> we want to start them as soon as possible. In fact, what we might
> want to do is actually sort the non-partial paths in order of
> decreasing cost, putting the most expensive one first and the others
> in decreasing order after that - and then similarly afterward with the
> partial paths. If we did that, we wouldn't need to store a bitmapset
> OR two separate lists. We could just store the index of the first
> partial plan in the list. Then you can test whether a path is partial
> by checking whether this_index >= first_partial_index.

I agree that we should preferably have the non-partial plans started
first. But I am not sure if it is really worth ordering the partial
plans by cost. The reason we ended up not keeping track of the
per-subplan parallel_worker, is because it would not matter much ,
and we would just equally distribute the workers among all regardless
of how big the subplans are. Even if smaller plans get more worker,
they will finish faster, and workers would be available to larger
subplans sooner.

Anyways, I have given a thought on the logic of choosing the next plan
, and that is irrespective of whether the list is sorted. I have
included Ashutosh's proposal of scanning the array round-robin as
against finding the minimum, since that method will automatically
distribute the workers evenly. Also, the logic uses a single array and
keeps track of first partial plan. The first section of the array is
non-partial, followed by partial plans. Below is the algorithm ...
There might be corner cases which I didn't yet take into account, but
first I wanted to get an agreement if this looks ok to go ahead with.
Since it does not find minimum worker count, it no longer uses
pa_num_workers. Instead it has boolean field painfo->pa_finished.

parallel_append_next(AppendState *state)
{

/* Make a note of which subplan we have started with */
initial_plan = padesc->next_plan;

/* Keep going to the next plan until we find an unfinished one. In
the process, also keep track of the first unfinished subplan. As the
non-partial subplans are taken one by one, the unfinished subplan will
shift ahead, so that we don't have to scan these anymore */

whichplan = initial_plan;
for (;;)
{
ParallelAppendInfo *painfo = &padesc->pa_info[whichplan];

/*
* Ignore plans that are already done processing. These also include
* non-partial subplans which have already been taken by a worker.
*/
if (!painfo->pa_finished)
{
/* If this a non-partial plan, immediately mark it
finished, and shift ahead first_plan */
if (whichplan < padesc->first_partial_plan)
{
padesc->pa_info[whichplan].pa_finished = true;
padesc->first_plan++;
}

break;
}

/* Either go to the next index, or wrap around to the first
unfinished one */
whichplan = goto_next_plan(whichplan, padesc->first_plan,
padesc->as_nplans - 1));

/* Have we scanned all subplans ? If yes, we are done */
if (whichplan == initial_plan)
break;
}

/* If we didn't find any plan to execute, stop executing. */
if (whichplan == initial_plan || whichplan == PA_INVALID_PLAN)
return false;
else
{
/* Set the chosen plan, and also the next plan to be picked by
other workers */
state->as_whichplan = whichplan;
padesc->next_plan = goto_next_plan(whichplan,
padesc->first_plan, padesc->as_nplans - 1));
return true;
}
}

/* Either go to the next index, or wrap around to the first unfinished one */
int goto_next_plan(curplan, first_plan, last_plan)
{
if (curplan + 1 <= last_plan)
return curplan + 1;
else
return first_plan;
}

>
> One problem with that is that, since the leader has about a 4ms head
> start on the other workers, it would tend to pick the most expensive
> path to run locally before any other worker had a chance to make a
> selection, and that's probably not what we want. To fix that, let's
> have the leader start at the end of the list of plans and work
> backwards towards the beginning, so that it prefers cheaper and
> partial plans over decisions that would force it to undertake a large
> amount of work itself.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Khandekar 2017-03-13 09:04:09 Re: Parallel Append implementation
Previous Message Emre Hasegeli 2017-03-13 08:58:17 Re: BRIN cost estimate