Quick Links

Re: Parallel Append implementation

From:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Parallel Append implementation
Date:	2017-02-06 06:45:05
Message-ID:	CAJ3gD9finGHf_nBe5zW8Euri-ZZvDACGuO9+0fb=rGR27sR2-g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

v2 patch was not rebased over the latest master branch commits. Please
refer to the attached ParallelAppend_v3.patch, instead.

On 6 February 2017 at 11:06, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> We may want to think about a third goal: preventing too many workers
>>> from executing the same plan. As per comment in get_parallel_divisor()
>>> we do not see any benefit if more than 4 workers execute the same
>>> node. So, an append node can distribute more than 4 worker nodes
>>> equally among the available subplans. It might be better to do that as
>>> a separate patch.
>>
>> I think that comment is for calculating leader contribution. It does
>> not say that 4 workers is too many workers in general.
>>
>> But yes, I agree, and I have it in mind as the next improvement.
>> Basically, it does not make sense to give more than 3 workers to a
>> subplan when parallel_workers for that subplan are 3. For e.g., if
>> gather max workers is 10, and we have 2 Append subplans s1 and s2 with
>> parallel workers 3 and 5 respectively. Then, with the current patch,
>> it will distribute 4 workers to each of these workers. What we should
>> do is : once both of the subplans get 3 workers each, we should give
>> the 7th and 8th worker to s2.
>>
>> Now that I think of that, I think for implementing above, we need to
>> keep track of per-subplan max_workers in the Append path; and with
>> that, the bitmap will be redundant. Instead, it can be replaced with
>> max_workers. Let me check if it is easy to do that. We don't want to
>> have the bitmap if we are sure it would be replaced by some other data
>> structure.
>
> Attached is v2 patch, which implements above. Now Append plan node
> stores a list of per-subplan max worker count, rather than the
> Bitmapset. But still Bitmapset turned out to be necessary for
> AppendPath. More details are in the subsequent comments.
>
>
>>> Goal A : Allow non-partial subpaths in Partial Append.
>>> Goal B : Distribute workers across the Append subplans.
>>> Both of these require some kind of synchronization while choosing the
>>> next subplan. So, goal B is achieved by doing all the synchronization
>>> stuff. And implementation of goal A requires that goal B is
>>> implemented. So there is a dependency between these two goals. While
>>> implementing goal B, we should keep in mind that it should also work
>>> for goal A; it does not make sense later changing the synchronization
>>> logic in goal A.
>>>
>>> I am ok with splitting the patch into 2 patches :
>>> a) changes required for goal A
>>> b) changes required for goal B.
>>> But I think we should split it only when we are ready to commit them
>>> (commit for B, immediately followed by commit for A). Until then, we
>>> should consider both of these together because they are interconnected
>>> as explained above.
>>
>> For B, we need to know, how much gain that brings and in which cases.
>> If that gain is not worth the complexity added, we may have to defer
>> Goal B. Goal A would certainly be useful since it will improve
>> performance of the targetted cases. The synchronization required for
>> Goal A is simpler than that of B and thus if we choose to implement
>> only A, we can live with a simpler synchronization.
>
> For Goal A , the logic for a worker synchronously choosing a subplan will be :
> Go the next subplan. If that subplan has not already assigned max
> workers, choose this subplan, otherwise, go the next subplan, and so
> on.
> For Goal B , the logic will be :
> Among the subplans which are yet to achieve max workers, choose the
> subplan with the minimum number of workers currently assigned.
>
> I don't think there is a significant difference between the complexity
> of the above two algorithms. So I think here the complexity does not
> look like a factor based on which we can choose the particular logic.
> We should choose the logic which has more potential for benefits. The
> logic for goal B will work for goal A as well. And secondly, if the
> subplans are using their own different system resources, the resource
> contention might be less. One case is : all subplans using different
> disks. Second case is : some of the subplans may be using a foreign
> scan, so it would start using foreign server resources sooner. These
> benefits apply when the Gather max workers count is not sufficient for
> running all the subplans in their full capacity. If they are
> sufficient, then the workers will be distributed over the subplans
> using both the logics. Just the order of assignments of workers to
> subplans will be different.
>
> Also, I don't see a disadvantage if we follow the logic of Goal B.
>
>>
>> BTW, Right now, the patch does not consider non-partial paths for a
>> child which has partial paths. Do we know, for sure, that a path
>> containing partial paths for a child, which has it, is always going to
>> be cheaper than the one which includes non-partial path. If not,
>> should we build another paths which contains non-partial paths for all
>> child relations. This sounds like a 0/1 knapsack problem.
>
> I didn't quite get this. We do create a non-partial Append path using
> non-partial child paths anyways.
>
>>
>>>
>>>
>>>> Here are some review comments
>>> I will handle the other comments, but first, just a quick response to
>>> some important ones :
>>>
>>>> 6. By looking at parallel_worker field of a path, we can say whether it's
>>>> partial or not. We probably do not require to maintain a bitmap for that at in
>>>> the Append path. The bitmap can be constructed, if required, at the time of
>>>> creating the partial append plan. The reason to take this small step is 1. we
>>>> want to minimize our work at the time of creating paths, 2. while freeing a
>>>> path in add_path, we don't free the internal structures, in this case the
>>>> Bitmap, which will waste memory if the path is not chosen while planning.
>>>
>>> Let me try keeping the per-subplan max_worker info in Append path
>>> itself, like I mentioned above. If that works, the bitmap will be
>>> replaced by max_worker field. In case of non-partial subpath,
>>> max_worker will be 1. (this is the same info kept in AppendState node
>>> in the patch, but now we might need to keep it in Append path node as
>>> well).
>>
>> It will be better if we can fetch that information from each subpath
>> when creating the plan. As I have explained before, a path is minimal
>> structure, which should be easily disposable, when throwing away the
>> path.
>
> Now in the v2 patch, we store per-subplan worker count. But still, we
> cannot use the path->parallel_workers to determine whether it's a
> partial path. This is because even for a non-partial path, it seems
> the parallel_workers can be non-zero. For e.g., in
> create_subqueryscan_path(), it sets path->parallel_workers to
> subpath->parallel_workers. But this path is added as a non-partial
> path. So we need a separate info as to which of the subpaths in Append
> path are partial subpaths. So in the v2 patch, I continued to use
> Bitmapset in AppendPath. But in Append plan node, number of workers is
> calculated using this bitmapset. Check the new function
> get_append_num_workers().
>
>>>> 7. If we consider 6, we don't need concat_append_subpaths(),
> As explained above, I have kept the BitmapSet for AppendPath.
>
>>>> but still here are
>>>> some comments about that function. Instead of accepting two separate arguments
>>>> childpaths and child_partial_subpaths_set, which need to be in sync, we can
>>>> just pass the path which contains both of those. In the same following code may
>>>> be optimized by adding a utility function to Bitmapset, which advances
>>>> all members
>>>> by given offset and using that function with bms_union() to merge the
>>>> bitmapset e.g.
>>>> bms_union(*partial_subpaths_set,
>>>> bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
>>>> if (partial_subpaths_set)
>
> I will get back on this after more thought.
>
>>
>>>
>>>> 12. cost_append() essentially adds costs of all the subpaths and then divides
>>>> by parallel_divisor. This might work if all the subpaths are partial paths. But
>>>> for the subpaths which are not partial, a single worker will incur the whole
>>>> cost of that subpath. Hence just dividing all the total cost doesn't seem the
>>>> right thing to do. We should apply different logic for costing non-partial
>>>> subpaths and partial subpaths.
>>>
>>> WIth the current partial path costing infrastructure, it is assumed
>>> that a partial path node should return the average per-worker cost.
>>> Hence, I thought it would be best to do it in a similar way for
>>> Append. But let me think if we can do something. With the current
>>> parallelism costing infrastructure, I am not sure though.
>>
>> The current parallel mechanism is in sync with that costing. Each
>> worker is supposed to take the same burden, hence the same (average)
>> cost. But it will change when a single worker has to scan an entire
>> child relation and different child relations have different sizes.
>
> I gave more thought on this. Considering each subplan has different
> number of workers, I think it makes sense to calculate average
> per-worker cost even in parallel Append. In case of non-partial
> subplan, a single worker will execute it, but it will next choose
> another subplan. So on average each worker is going to process the
> same number of rows, and also the same amount of CPU. And that amount
> of CPU cost and rows cost should be calculated by taking the total
> count and dividing it by number of workers (parallel_divsor actually).
>
>
>> Here are some review comments
>>
>> 1. struct ParallelAppendDescData is being used at other places. The declaration
>> style doesn't seem to be very common in the code or in the directory where the
>> file is located.
>> +struct ParallelAppendDescData
>> +{
>> + slock_t pa_mutex; /* mutual exclusion to choose
>> next subplan */
>> + parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
>> +};
>> Defining it like
>> typdef struct ParallelAppendDescData
>> {
>> slock_t pa_mutex; /* mutual exclusion to choose next
>> subplan */
>> parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
>> };
>> will make its use handy. Instead of struct ParallelAppendDescData, you will
>> need to use just ParallelAppendDescData. May be we want to rename
>> parallel_append_info as ParallelAppendInfo and change the style to match other
>> declarations.
>>
>> 2. The comment below refers to the constant which it describes, which looks
>> odd. May be it should be worded as "A special value of
>> AppendState::as_whichplan, to indicate no plans left to be executed.". Also
>> using INVALID for "no plans left ..." seems to be a misnomer.
>> /*
>> * For Parallel Append, AppendState::as_whichplan can have PA_INVALID_PLAN
>> * value, which indicates there are no plans left to be executed.
>> */
>> #define PA_INVALID_PLAN -1
>>
>> 3. The sentence "We have got NULL", looks odd. Probably we don't need it as
>> it's clear from the code above that this code deals with the case when the
>> current subplan didn't return any row.
>> /*
>> * We have got NULL. There might be other workers still processing the
>> * last chunk of rows for this same node, but there's no point for new
>> * workers to run this node, so mark this node as finished.
>> */
>> 4. In the same comment, I guess, the word "node" refers to "subnode" and not
>> the node pointed by variable "node". May be you want to use word "subplan"
>> here.
>>
>> 4. set_finished()'s prologue has different indentation compared to other
>> functions in the file.
>>
>> 5. Multilevel comment starts with an empty line.
>> + /* Keep track of the node with the least workers so far. For the very
>>
> Done 1. to 5. above, as per your suggestions.
>
>> 9. Shouldn't this funciton return double?
>> int
>> get_parallel_divisor(int parallel_workers)
>
> v2 patch is rebased on latest master branch, which already contains
> this function returning double.
>
>
>> 10. We should probably move the parallel_safe calculation out of cost_append().
>> + path->parallel_safe = path->parallel_safe &&
>> + subpath->parallel_safe;
>>
>> 11. This check shouldn't be part of cost_append().
>> + /* All child paths must have same parameterization */
>> + Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>>
> Yet to handle the above ones.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
ParallelAppend_v3.patch	application/octet-stream	40.2 KB

In response to

Re: Parallel Append implementation at 2017-02-06 05:36:03 from Amit Khandekar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Boris Muratshin	2017-02-06 07:01:09	Re: 3D Z-curve spatial index
Previous Message	Corey Huinker	2017-02-06 06:39:49	Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)