Re: Parallel Append implementation

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-02-13 13:44:09
Message-ID: CAFjFpRf6tQek6ZRzWoT-ZG2SQ534Kd8ubK6D-8GY6V3t9QfnfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 6, 2017 at 11:06 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> We may want to think about a third goal: preventing too many workers
>>> from executing the same plan. As per comment in get_parallel_divisor()
>>> we do not see any benefit if more than 4 workers execute the same
>>> node. So, an append node can distribute more than 4 worker nodes
>>> equally among the available subplans. It might be better to do that as
>>> a separate patch.
>>
>> I think that comment is for calculating leader contribution. It does
>> not say that 4 workers is too many workers in general.
>>
>> But yes, I agree, and I have it in mind as the next improvement.
>> Basically, it does not make sense to give more than 3 workers to a
>> subplan when parallel_workers for that subplan are 3. For e.g., if
>> gather max workers is 10, and we have 2 Append subplans s1 and s2 with
>> parallel workers 3 and 5 respectively. Then, with the current patch,
>> it will distribute 4 workers to each of these workers. What we should
>> do is : once both of the subplans get 3 workers each, we should give
>> the 7th and 8th worker to s2.
>>
>> Now that I think of that, I think for implementing above, we need to
>> keep track of per-subplan max_workers in the Append path; and with
>> that, the bitmap will be redundant. Instead, it can be replaced with
>> max_workers. Let me check if it is easy to do that. We don't want to
>> have the bitmap if we are sure it would be replaced by some other data
>> structure.
>
> Attached is v2 patch, which implements above. Now Append plan node
> stores a list of per-subplan max worker count, rather than the
> Bitmapset. But still Bitmapset turned out to be necessary for
> AppendPath. More details are in the subsequent comments.
>
>
>>> Goal A : Allow non-partial subpaths in Partial Append.
>>> Goal B : Distribute workers across the Append subplans.
>>> Both of these require some kind of synchronization while choosing the
>>> next subplan. So, goal B is achieved by doing all the synchronization
>>> stuff. And implementation of goal A requires that goal B is
>>> implemented. So there is a dependency between these two goals. While
>>> implementing goal B, we should keep in mind that it should also work
>>> for goal A; it does not make sense later changing the synchronization
>>> logic in goal A.
>>>
>>> I am ok with splitting the patch into 2 patches :
>>> a) changes required for goal A
>>> b) changes required for goal B.
>>> But I think we should split it only when we are ready to commit them
>>> (commit for B, immediately followed by commit for A). Until then, we
>>> should consider both of these together because they are interconnected
>>> as explained above.
>>
>> For B, we need to know, how much gain that brings and in which cases.
>> If that gain is not worth the complexity added, we may have to defer
>> Goal B. Goal A would certainly be useful since it will improve
>> performance of the targetted cases. The synchronization required for
>> Goal A is simpler than that of B and thus if we choose to implement
>> only A, we can live with a simpler synchronization.
>
> For Goal A , the logic for a worker synchronously choosing a subplan will be :
> Go the next subplan. If that subplan has not already assigned max
> workers, choose this subplan, otherwise, go the next subplan, and so
> on.

Right, at a given time, we have to remember only the next plan to
assign worker to. That's simpler than remembering the number of
workers for each subplan and updating those concurrently. That's why I
am saying synchronization for A is simpler than that of B.

> For Goal B , the logic will be :
> Among the subplans which are yet to achieve max workers, choose the
> subplan with the minimum number of workers currently assigned.
>
> I don't think there is a significant difference between the complexity
> of the above two algorithms. So I think here the complexity does not
> look like a factor based on which we can choose the particular logic.
> We should choose the logic which has more potential for benefits. The
> logic for goal B will work for goal A as well. And secondly, if the
> subplans are using their own different system resources, the resource
> contention might be less. One case is : all subplans using different
> disks. Second case is : some of the subplans may be using a foreign
> scan, so it would start using foreign server resources sooner. These
> benefits apply when the Gather max workers count is not sufficient for
> running all the subplans in their full capacity. If they are
> sufficient, then the workers will be distributed over the subplans
> using both the logics. Just the order of assignments of workers to
> subplans will be different.
>
> Also, I don't see a disadvantage if we follow the logic of Goal B.

Do we have any performance measurements where we see that Goal B
performs better than Goal A, in such a situation? Do we have any
performance measurement comparing these two approaches in other
situations. If implementation for Goal B beats that of Goal A always,
we can certainly implement it directly. But it may not. Also,
separating patches for Goal A and Goal B might make reviews easier.

>
>>
>> BTW, Right now, the patch does not consider non-partial paths for a
>> child which has partial paths. Do we know, for sure, that a path
>> containing partial paths for a child, which has it, is always going to
>> be cheaper than the one which includes non-partial path. If not,
>> should we build another paths which contains non-partial paths for all
>> child relations. This sounds like a 0/1 knapsack problem.
>
> I didn't quite get this. We do create a non-partial Append path using
> non-partial child paths anyways.

Let's say a given child-relation has both partial and non-partial
paths, your approach would always pick up a partial path. But now that
parallel append can handle non-partial paths as well, it may happen
that picking up non-partial path instead of partial one when both are
available gives an overall better performance. Have we ruled out that
possibility.

>
>>
>>>
>>>
>>>> Here are some review comments
>>> I will handle the other comments, but first, just a quick response to
>>> some important ones :
>>>
>>>> 6. By looking at parallel_worker field of a path, we can say whether it's
>>>> partial or not. We probably do not require to maintain a bitmap for that at in
>>>> the Append path. The bitmap can be constructed, if required, at the time of
>>>> creating the partial append plan. The reason to take this small step is 1. we
>>>> want to minimize our work at the time of creating paths, 2. while freeing a
>>>> path in add_path, we don't free the internal structures, in this case the
>>>> Bitmap, which will waste memory if the path is not chosen while planning.
>>>
>>> Let me try keeping the per-subplan max_worker info in Append path
>>> itself, like I mentioned above. If that works, the bitmap will be
>>> replaced by max_worker field. In case of non-partial subpath,
>>> max_worker will be 1. (this is the same info kept in AppendState node
>>> in the patch, but now we might need to keep it in Append path node as
>>> well).
>>
>> It will be better if we can fetch that information from each subpath
>> when creating the plan. As I have explained before, a path is minimal
>> structure, which should be easily disposable, when throwing away the
>> path.
>
> Now in the v2 patch, we store per-subplan worker count. But still, we
> cannot use the path->parallel_workers to determine whether it's a
> partial path. This is because even for a non-partial path, it seems
> the parallel_workers can be non-zero. For e.g., in
> create_subqueryscan_path(), it sets path->parallel_workers to
> subpath->parallel_workers. But this path is added as a non-partial
> path. So we need a separate info as to which of the subpaths in Append
> path are partial subpaths. So in the v2 patch, I continued to use
> Bitmapset in AppendPath. But in Append plan node, number of workers is
> calculated using this bitmapset. Check the new function
> get_append_num_workers().

If the subpath from childrel->partial_pathlist, then we set the
corresponding bit in the bitmap. Now we can infer that for any path if
that path is found in path->parent->partial_pathlist. Since the code
always chooses the first partial path, the search in partial_pathlist
should not affect performance. So, we can avoid maintaining a bitmap
in the path and keep accumulating it when collapsing append paths.

>
>>>> 7. If we consider 6, we don't need concat_append_subpaths(),
> As explained above, I have kept the BitmapSet for AppendPath.
>
>>>> but still here are
>>>> some comments about that function. Instead of accepting two separate arguments
>>>> childpaths and child_partial_subpaths_set, which need to be in sync, we can
>>>> just pass the path which contains both of those. In the same following code may
>>>> be optimized by adding a utility function to Bitmapset, which advances
>>>> all members
>>>> by given offset and using that function with bms_union() to merge the
>>>> bitmapset e.g.
>>>> bms_union(*partial_subpaths_set,
>>>> bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
>>>> if (partial_subpaths_set)
>
> I will get back on this after more thought.

Another possibility, you could use a loop like offset_relid_set(),
using bms_next_member(). That way we could combine the for loop and
bms_is_member() call into a loop over bms_next_member().

>
>>
>>>
>>>> 12. cost_append() essentially adds costs of all the subpaths and then divides
>>>> by parallel_divisor. This might work if all the subpaths are partial paths. But
>>>> for the subpaths which are not partial, a single worker will incur the whole
>>>> cost of that subpath. Hence just dividing all the total cost doesn't seem the
>>>> right thing to do. We should apply different logic for costing non-partial
>>>> subpaths and partial subpaths.
>>>
>>> WIth the current partial path costing infrastructure, it is assumed
>>> that a partial path node should return the average per-worker cost.
>>> Hence, I thought it would be best to do it in a similar way for
>>> Append. But let me think if we can do something. With the current
>>> parallelism costing infrastructure, I am not sure though.
>>
>> The current parallel mechanism is in sync with that costing. Each
>> worker is supposed to take the same burden, hence the same (average)
>> cost. But it will change when a single worker has to scan an entire
>> child relation and different child relations have different sizes.
>
> I gave more thought on this. Considering each subplan has different
> number of workers, I think it makes sense to calculate average
> per-worker cost even in parallel Append. In case of non-partial
> subplan, a single worker will execute it, but it will next choose
> another subplan. So on average each worker is going to process the
> same number of rows, and also the same amount of CPU. And that amount
> of CPU cost and rows cost should be calculated by taking the total
> count and dividing it by number of workers (parallel_divsor actually).
>

That's not entirely true. Consider N child relations with chosen paths
with costs C1, C2, ... CN which are very very different. If there are
N workers, the total cost should correspond to the highest of the
costs of subpaths, since no worker will execute more than one plan.
The unfortunate worker which executes the costliest path would take
the longest time. The cost of parallel append should reflect that. The
patch does not make any attempt to distribute workers based on the
actual load, so such skews should be considered into costing. I don't
think we can do anything to the condition I explained.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2017-02-13 13:48:21 Re: Parallel bitmap heap scan
Previous Message Daniele Varrazzo 2017-02-13 13:37:21 Re: 'text' instead of 'unknown' in Postgres 10