Re: Parallel Append implementation

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-01-16 06:08:33
Message-ID: CAFjFpRdfTCQqbVftPL=jwt13Pm3KCBmY2HgjAu2iWngGGe2ONg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 16, 2017 at 9:49 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> Thanks Ashutosh for the feedback.
>
> On 6 January 2017 at 17:04, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> On Fri, Dec 23, 2016 at 10:51 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>>> Currently an Append plan node does not execute its subplans in
>>> parallel. There is no distribution of workers across its subplans. The
>>> second subplan starts running only after the first subplan finishes,
>>> although the individual subplans may be running parallel scans.
>>>
>>> Secondly, we create a partial Append path for an appendrel, but we do
>>> that only if all of its member subpaths are partial paths. If one or
>>> more of the subplans is a non-parallel path, there will be only a
>>> non-parallel Append. So whatever node is sitting on top of Append is
>>> not going to do a parallel plan; for example, a select count(*) won't
>>> divide it into partial aggregates if the underlying Append is not
>>> partial.
>>>
>>> The attached patch removes both of the above restrictions. There has
>>> already been a mail thread [1] that discusses an approach suggested by
>>> Robert Haas for implementing this feature. This patch uses this same
>>> approach.
>>
>> The first goal requires some kind of synchronization which will allow workers
>> to be distributed across the subplans. The second goal requires some kind of
>> synchronization to prevent multiple workers from executing non-parallel
>> subplans. The patch uses different mechanisms to achieve the goals. If
>> we create two different patches addressing each goal, those may be
>> easier to handle.
>
> Goal A : Allow non-partial subpaths in Partial Append.
> Goal B : Distribute workers across the Append subplans.
> Both of these require some kind of synchronization while choosing the
> next subplan. So, goal B is achieved by doing all the synchronization
> stuff. And implementation of goal A requires that goal B is
> implemented. So there is a dependency between these two goals. While
> implementing goal B, we should keep in mind that it should also work
> for goal A; it does not make sense later changing the synchronization
> logic in goal A.
>
> I am ok with splitting the patch into 2 patches :
> a) changes required for goal A
> b) changes required for goal B.
> But I think we should split it only when we are ready to commit them
> (commit for B, immediately followed by commit for A). Until then, we
> should consider both of these together because they are interconnected
> as explained above.

For B, we need to know, how much gain that brings and in which cases.
If that gain is not worth the complexity added, we may have to defer
Goal B. Goal A would certainly be useful since it will improve
performance of the targetted cases. The synchronization required for
Goal A is simpler than that of B and thus if we choose to implement
only A, we can live with a simpler synchronization.

BTW, Right now, the patch does not consider non-partial paths for a
child which has partial paths. Do we know, for sure, that a path
containing partial paths for a child, which has it, is always going to
be cheaper than the one which includes non-partial path. If not,
should we build another paths which contains non-partial paths for all
child relations. This sounds like a 0/1 knapsack problem.

>
>
>> Here are some review comments
> I will handle the other comments, but first, just a quick response to
> some important ones :
>
>> 6. By looking at parallel_worker field of a path, we can say whether it's
>> partial or not. We probably do not require to maintain a bitmap for that at in
>> the Append path. The bitmap can be constructed, if required, at the time of
>> creating the partial append plan. The reason to take this small step is 1. we
>> want to minimize our work at the time of creating paths, 2. while freeing a
>> path in add_path, we don't free the internal structures, in this case the
>> Bitmap, which will waste memory if the path is not chosen while planning.
>
> Let me try keeping the per-subplan max_worker info in Append path
> itself, like I mentioned above. If that works, the bitmap will be
> replaced by max_worker field. In case of non-partial subpath,
> max_worker will be 1. (this is the same info kept in AppendState node
> in the patch, but now we might need to keep it in Append path node as
> well).

It will be better if we can fetch that information from each subpath
when creating the plan. As I have explained before, a path is minimal
structure, which should be easily disposable, when throwing away the
path.

>
>>
>> 7. If we consider 6, we don't need concat_append_subpaths(), but still here are
>> some comments about that function. Instead of accepting two separate arguments
>> childpaths and child_partial_subpaths_set, which need to be in sync, we can
>> just pass the path which contains both of those. In the same following code may
>> be optimized by adding a utility function to Bitmapset, which advances
>> all members
>> by given offset and using that function with bms_union() to merge the
>> bitmapset e.g.
>> bms_union(*partial_subpaths_set,
>> bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
>> if (partial_subpaths_set)
>> {
>> for (i = 0; i < list_length(childpaths); i++)
>> {
>> /*
>> * The child paths themselves may or may not be partial paths. The
>> * bitmapset numbers of these paths will need to be set considering
>> * that these are to be appended onto the partial_subpaths_set.
>> */
>> if (!child_partial_subpaths_set ||
>> bms_is_member(i, child_partial_subpaths_set))
>> {
>> *partial_subpaths_set = bms_add_member(*partial_subpaths_set,
>> append_subpath_len + i);
>> }
>> }
>> }
>
> Again, for the reason mentioned above, we will defer this point for now.

Ok.

>
>>
>> 8.
>> - parallel_workers = Max(parallel_workers, path->parallel_workers);
>> + /*
>> + * partial_subpaths can have non-partial subpaths so
>> + * path->parallel_workers can be 0. For such paths, allocate one
>> + * worker.
>> + */
>> + parallel_workers +=
>> + (path->parallel_workers > 0 ? path->parallel_workers : 1);
>>
>> This looks odd. Earlier code was choosing maximum of all parallel workers,
>> whereas new code adds them all. E.g. if parallel_workers for subpaths is 3, 4,
>> 3, without your change, it will pick up 4. But with your change it will pick
>> 10. I think, you intend to write this as
>> parallel_workers = Max(parallel_workers, path->parallel_workers ?
>> path->parallel_workers : 1);
> The intention is to add all workers, because a parallel-aware Append
> is going to need them in order to make the subplans run with their
> full capacity in parallel. So with subpaths with 3, 4, and 3 workers,
> the Append path will need 10 workers. If it allocates 4 workers, its
> not sufficient : Each of them would get only 1 worker, or max 2. In
> the existing code, 4 is correct, because all the workers are going to
> execute the same subplan node at a time.
>

Ok, makes sense if we take up Goal B.

>>
>> 9. In get_parallel_divisor(), if parallel_worker is 0 i.e. it's a partial path
>> the return value will be 2, which isn't true. This function is being called for
>> all the subpaths to get the original number of rows and costs of partial paths.
>> I think we don't need to call this function on subpaths which are not partial
>> paths or make it work parallel_workers = 0.
> I didn't understand this. I checked again get_parallel_divisor()
> function code. I think it will return 1 if parallel_workers is 0. But
> I may be missing something.

Sorry, I also don't understand why I had that comment. For some
reason, I thought we are sending 1 when parallel_workers = 0 to
get_parallel_divisor(). But I don't understand why I thought so.
Anyway, I will provide better explanation next time I bounce against
this.

>
>
>> 12. cost_append() essentially adds costs of all the subpaths and then divides
>> by parallel_divisor. This might work if all the subpaths are partial paths. But
>> for the subpaths which are not partial, a single worker will incur the whole
>> cost of that subpath. Hence just dividing all the total cost doesn't seem the
>> right thing to do. We should apply different logic for costing non-partial
>> subpaths and partial subpaths.
>
> WIth the current partial path costing infrastructure, it is assumed
> that a partial path node should return the average per-worker cost.
> Hence, I thought it would be best to do it in a similar way for
> Append. But let me think if we can do something. With the current
> parallelism costing infrastructure, I am not sure though.

The current parallel mechanism is in sync with that costing. Each
worker is supposed to take the same burden, hence the same (average)
cost. But it will change when a single worker has to scan an entire
child relation and different child relations have different sizes.

Thanks for working on the comments.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-01-16 08:15:48 check_srf_call_placement() isn't always setting p_hasTargetSRFs
Previous Message Michael Paquier 2017-01-16 06:08:28 Re: An isolation test for SERIALIZABLE READ ONLY DEFERRABLE