Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-02-06 05:36:03
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> We may want to think about a third goal: preventing too many workers
>> from executing the same plan. As per comment in get_parallel_divisor()
>> we do not see any benefit if more than 4 workers execute the same
>> node. So, an append node can distribute more than 4 worker nodes
>> equally among the available subplans. It might be better to do that as
>> a separate patch.
> I think that comment is for calculating leader contribution. It does
> not say that 4 workers is too many workers in general.
> But yes, I agree, and I have it in mind as the next improvement.
> Basically, it does not make sense to give more than 3 workers to a
> subplan when parallel_workers for that subplan are 3. For e.g., if
> gather max workers is 10, and we have 2 Append subplans s1 and s2 with
> parallel workers 3 and 5 respectively. Then, with the current patch,
> it will distribute 4 workers to each of these workers. What we should
> do is : once both of the subplans get 3 workers each, we should give
> the 7th and 8th worker to s2.
> Now that I think of that, I think for implementing above, we need to
> keep track of per-subplan max_workers in the Append path; and with
> that, the bitmap will be redundant. Instead, it can be replaced with
> max_workers. Let me check if it is easy to do that. We don't want to
> have the bitmap if we are sure it would be replaced by some other data
> structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.

>> Goal A : Allow non-partial subpaths in Partial Append.
>> Goal B : Distribute workers across the Append subplans.
>> Both of these require some kind of synchronization while choosing the
>> next subplan. So, goal B is achieved by doing all the synchronization
>> stuff. And implementation of goal A requires that goal B is
>> implemented. So there is a dependency between these two goals. While
>> implementing goal B, we should keep in mind that it should also work
>> for goal A; it does not make sense later changing the synchronization
>> logic in goal A.
>> I am ok with splitting the patch into 2 patches :
>> a) changes required for goal A
>> b) changes required for goal B.
>> But I think we should split it only when we are ready to commit them
>> (commit for B, immediately followed by commit for A). Until then, we
>> should consider both of these together because they are interconnected
>> as explained above.
> For B, we need to know, how much gain that brings and in which cases.
> If that gain is not worth the complexity added, we may have to defer
> Goal B. Goal A would certainly be useful since it will improve
> performance of the targetted cases. The synchronization required for
> Goal A is simpler than that of B and thus if we choose to implement
> only A, we can live with a simpler synchronization.

For Goal A , the logic for a worker synchronously choosing a subplan will be :
Go the next subplan. If that subplan has not already assigned max
workers, choose this subplan, otherwise, go the next subplan, and so
For Goal B , the logic will be :
Among the subplans which are yet to achieve max workers, choose the
subplan with the minimum number of workers currently assigned.

I don't think there is a significant difference between the complexity
of the above two algorithms. So I think here the complexity does not
look like a factor based on which we can choose the particular logic.
We should choose the logic which has more potential for benefits. The
logic for goal B will work for goal A as well. And secondly, if the
subplans are using their own different system resources, the resource
contention might be less. One case is : all subplans using different
disks. Second case is : some of the subplans may be using a foreign
scan, so it would start using foreign server resources sooner. These
benefits apply when the Gather max workers count is not sufficient for
running all the subplans in their full capacity. If they are
sufficient, then the workers will be distributed over the subplans
using both the logics. Just the order of assignments of workers to
subplans will be different.

Also, I don't see a disadvantage if we follow the logic of Goal B.

> BTW, Right now, the patch does not consider non-partial paths for a
> child which has partial paths. Do we know, for sure, that a path
> containing partial paths for a child, which has it, is always going to
> be cheaper than the one which includes non-partial path. If not,
> should we build another paths which contains non-partial paths for all
> child relations. This sounds like a 0/1 knapsack problem.

I didn't quite get this. We do create a non-partial Append path using
non-partial child paths anyways.

>>> Here are some review comments
>> I will handle the other comments, but first, just a quick response to
>> some important ones :
>>> 6. By looking at parallel_worker field of a path, we can say whether it's
>>> partial or not. We probably do not require to maintain a bitmap for that at in
>>> the Append path. The bitmap can be constructed, if required, at the time of
>>> creating the partial append plan. The reason to take this small step is 1. we
>>> want to minimize our work at the time of creating paths, 2. while freeing a
>>> path in add_path, we don't free the internal structures, in this case the
>>> Bitmap, which will waste memory if the path is not chosen while planning.
>> Let me try keeping the per-subplan max_worker info in Append path
>> itself, like I mentioned above. If that works, the bitmap will be
>> replaced by max_worker field. In case of non-partial subpath,
>> max_worker will be 1. (this is the same info kept in AppendState node
>> in the patch, but now we might need to keep it in Append path node as
>> well).
> It will be better if we can fetch that information from each subpath
> when creating the plan. As I have explained before, a path is minimal
> structure, which should be easily disposable, when throwing away the
> path.

Now in the v2 patch, we store per-subplan worker count. But still, we
cannot use the path->parallel_workers to determine whether it's a
partial path. This is because even for a non-partial path, it seems
the parallel_workers can be non-zero. For e.g., in
create_subqueryscan_path(), it sets path->parallel_workers to
subpath->parallel_workers. But this path is added as a non-partial
path. So we need a separate info as to which of the subpaths in Append
path are partial subpaths. So in the v2 patch, I continued to use
Bitmapset in AppendPath. But in Append plan node, number of workers is
calculated using this bitmapset. Check the new function

>>> 7. If we consider 6, we don't need concat_append_subpaths(),
As explained above, I have kept the BitmapSet for AppendPath.

>>> but still here are
>>> some comments about that function. Instead of accepting two separate arguments
>>> childpaths and child_partial_subpaths_set, which need to be in sync, we can
>>> just pass the path which contains both of those. In the same following code may
>>> be optimized by adding a utility function to Bitmapset, which advances
>>> all members
>>> by given offset and using that function with bms_union() to merge the
>>> bitmapset e.g.
>>> bms_union(*partial_subpaths_set,
>>> bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
>>> if (partial_subpaths_set)

I will get back on this after more thought.

>>> 12. cost_append() essentially adds costs of all the subpaths and then divides
>>> by parallel_divisor. This might work if all the subpaths are partial paths. But
>>> for the subpaths which are not partial, a single worker will incur the whole
>>> cost of that subpath. Hence just dividing all the total cost doesn't seem the
>>> right thing to do. We should apply different logic for costing non-partial
>>> subpaths and partial subpaths.
>> WIth the current partial path costing infrastructure, it is assumed
>> that a partial path node should return the average per-worker cost.
>> Hence, I thought it would be best to do it in a similar way for
>> Append. But let me think if we can do something. With the current
>> parallelism costing infrastructure, I am not sure though.
> The current parallel mechanism is in sync with that costing. Each
> worker is supposed to take the same burden, hence the same (average)
> cost. But it will change when a single worker has to scan an entire
> child relation and different child relations have different sizes.

I gave more thought on this. Considering each subplan has different
number of workers, I think it makes sense to calculate average
per-worker cost even in parallel Append. In case of non-partial
subplan, a single worker will execute it, but it will next choose
another subplan. So on average each worker is going to process the
same number of rows, and also the same amount of CPU. And that amount
of CPU cost and rows cost should be calculated by taking the total
count and dividing it by number of workers (parallel_divsor actually).

> Here are some review comments
> 1. struct ParallelAppendDescData is being used at other places. The declaration
> style doesn't seem to be very common in the code or in the directory where the
> file is located.
> +struct ParallelAppendDescData
> +{
> + slock_t pa_mutex; /* mutual exclusion to choose
> next subplan */
> + parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
> +};
> Defining it like
> typdef struct ParallelAppendDescData
> {
> slock_t pa_mutex; /* mutual exclusion to choose next
> subplan */
> parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
> };
> will make its use handy. Instead of struct ParallelAppendDescData, you will
> need to use just ParallelAppendDescData. May be we want to rename
> parallel_append_info as ParallelAppendInfo and change the style to match other
> declarations.
> 2. The comment below refers to the constant which it describes, which looks
> odd. May be it should be worded as "A special value of
> AppendState::as_whichplan, to indicate no plans left to be executed.". Also
> using INVALID for "no plans left ..." seems to be a misnomer.
> /*
> * For Parallel Append, AppendState::as_whichplan can have PA_INVALID_PLAN
> * value, which indicates there are no plans left to be executed.
> */
> #define PA_INVALID_PLAN -1
> 3. The sentence "We have got NULL", looks odd. Probably we don't need it as
> it's clear from the code above that this code deals with the case when the
> current subplan didn't return any row.
> /*
> * We have got NULL. There might be other workers still processing the
> * last chunk of rows for this same node, but there's no point for new
> * workers to run this node, so mark this node as finished.
> */
> 4. In the same comment, I guess, the word "node" refers to "subnode" and not
> the node pointed by variable "node". May be you want to use word "subplan"
> here.
> 4. set_finished()'s prologue has different indentation compared to other
> functions in the file.
> 5. Multilevel comment starts with an empty line.
> + /* Keep track of the node with the least workers so far. For the very
Done 1. to 5. above, as per your suggestions.

> 9. Shouldn't this funciton return double?
> int
> get_parallel_divisor(int parallel_workers)

v2 patch is rebased on latest master branch, which already contains
this function returning double.

> 10. We should probably move the parallel_safe calculation out of cost_append().
> + path->parallel_safe = path->parallel_safe &&
> + subpath->parallel_safe;
> 11. This check shouldn't be part of cost_append().
> + /* All child paths must have same parameterization */
> + Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
Yet to handle the above ones.

Attachment Content-Type Size
ParallelAppend_v2.patch application/octet-stream 40.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-02-06 05:40:07 Re: 0/NULL/NIL assignments in build_*_rel()
Previous Message Ashutosh Bapat 2017-02-06 05:27:55 Re: 0/NULL/NIL assignments in build_*_rel()