Re: Parallel Append implementation

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-22 08:49:13
Message-ID: CAJ3gD9e-_qFOBkoQCR7OV6PjJ2uMjftEPSCHMBNPT7ZohSzCow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is the updated patch that handles the changes for all the
comments except the cost changes part. Details about the specific
changes are after the cost-related points discussed below.

>> I wanted to take into account per­subpath parallel_workers for total
>> cost of Append. Suppose the partial subpaths have per worker total
>> costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
>> Append workers available. So according to what you say, the total cost
>> is 9. With per­subplan parallel_workers taken into account, total cost
>> = (3*2 + 3*8 * 3*4)/2 = 21.
> But that case never happens, because the parallel workers for the
> append is always at least as large as the number of workers for any
> single child.

Yeah, that's right. I will use this approach for partial paths.

For non-partial paths, I was checking following 3 options :

Option 1. Just take the sum of total non-partial child costs and
divide it by number of workers. It seems to be getting close to the
actual cost.

Option 2. Calculate exact cost by an algorithm which I mentioned
before, which is pasted below for reference :
Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
times remaining are :
10 6 0 8 3 1
After 6 units (minimum of 10, 06, 08), the times remaining are :
4 0 0 2 3 1
After 2 units (minimum of 4, 2, 3), the times remaining are :
2 0 0 0 1 1
After 1 units (minimum of 2, 1, 1), the times remaining are :
1 0 0 0 0 0
After 1 units (minimum of 1, 0 , 0), the times remaining are :
0 0 0 0 0 0
Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

Option 3. Get some approximation formula like you suggested. I am also
looking for such formula, just that some things are not clear to me.
The discussion of the same is below ...
>>> 2. Next, estimate the cost of the non­partial paths. To do this, make
>>> an array of Cost of that length and initialize all the elements to
>>> zero, then add the total cost of each non­partial plan in turn to the
>>> element of the array with the smallest cost, and then take the maximum
>>> of the array elements as the total cost of the non­partial plans. Add
>>> this to the result from step 1 to get the total cost.
>>
>> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
>> 15) , and so the max is 15 ? I surely am misinterpreting this.
> No. If you have costs 8, 5, and 2 and only one process, cost is 15.
> If you have two processes then for costing purposes you assume worker
> 1 will execute the first path (cost 8) and worker 2 will execute the
> other two (cost 5 + 2 = 7), so the total cost is 8. If you have three
> workers, the cost will still be 8, because there's no way to finish
> the cost­8 path in less than 8 units of work.

So the part that you suggested about adding up total cost in turn to
the smallest cost; this suggestion applies to only 1 worker right ?
For more than worker, are you suggesting to use some algorithm similar
to the one I suggested in option 2 above ? If yes, it would be great
if you again describe how that works for multiple workers. Or is it
that you were suggesting some simple approximate arithmetic that
applies to multiple workers ?
Like I mentioned, I will be happy to get such simple approximation
arithmetic that can be applied for multiple worker case. The one logic
I suggested in option 2 is something we can keep as the last option.
And option 1 is also an approximation but we would like to have a
better approximation. So wanted to clear my queries regarding option
3.

----------

Details about all the remaining changes in updated patch are below ...

On 20 March 2017 at 17:29, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 17, 2017 at 1:12 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>>> - The substantive changes in add_paths_to_append_rel don't look right
>>> either. It's not clear why accumulate_partialappend_subpath is
>>> getting called even in the non-enable_parallelappend case. I don't
>>> think the logic for the case where we're not generating a parallel
>>> append path needs to change at all.
>>
>> When accumulate_partialappend_subpath() is called for a childrel with
>> a partial path, it works just like accumulate_append_subpath() when
>> enable_parallelappend is false. That's why, for partial child path,
>> the same function is called irrespective of parallel-append or
>> non-parallel-append case. May be mentioning this in comments should
>> suffice here ?
>
> I don't get it. If you can get the same effect by changing something
> or not changing it, presumably it'd be better to not change it. We
> try not to change things just because we can; the change should be an
> improvement in some way.
>
>>> - When parallel append is enabled, I think add_paths_to_append_rel
>>> should still consider all the same paths that it does today, plus one
>>> extra. The new path is a parallel append path where each subpath is
>>> the cheapest subpath for that childrel, whether partial or
>>> non-partial. If !enable_parallelappend, or if all of the cheapest
>>> subpaths are partial, then skip this. (If all the cheapest subpaths
>>> are non-partial, it's still potentially useful.)
>>
>> In case of all-partial childrels, the paths are *exactly* same as
>> those that would have been created for enable_parallelappend=off. The
>> extra path is there for enable_parallelappend=on only when one or more
>> of the child rels do not have partial paths. Does this make sense ?
>
> No, I don't think so. Imagine that we have three children, A, B, and
> C. The cheapest partial paths have costs of 10,000 each. A, however,
> has a non-partial path with a cost of 1,000. Even though A has a
> partial path, we still want to consider a parallel append using the
> non-partial path because it figures to be hugely faster.

Right. Now that we want to consider both cheapest partial and cheapest
non-partial path, I now get what you were saying about having an extra
path for parallel_append. I have done all of the above changes. Now we
have an extra path for enable_parallelappend=true, besides the
non-parallel partial append path.

> - You've added a GUC (which is good) but not documented it (which is
> bad) or added it to postgresql.conf.sample (also bad).

Done.

>
> - You've used a loop inside a spinlock-protected critical section,
> which is against project policy. Use an LWLock; define and document a
> new builtin tranche ID.

Done. Used LWlock for the parallel append synchronization. But I am
not sure what does "document the new builtin trancheID" mean. Didn't
find a readme which documents tranche ids.

For setting pa_finished=true when a partial plan finished, earlier it
was using Spinlock. Now it does not use any synchronization. It was
actually earlier using it because there was another field num_workers,
but it is not needed since there is no num_workers. I was considering
whether to use atomic read and write API in atomics.c for pa_finished.
But from what I understand, just a plain read/write is already atomic.
We require them only if there are some compound operations like
increment, exchange, etc.

>
> - The comment for pa_finished claims that it is the number of workers
> executing the subplan, but it's a bool, not a count; I think this
> comment is just out of date.

Done.

>
> - paths_insert_sorted_by_cost() is a hand-coded insertion sort. Can't
> we find a way to use qsort() for this instead of hand-coding a slower
> algorithm? I think we could just create an array of the right length,
> stick each path into it from add_paths_to_append_rel, and then qsort()
> the array based on <is-partial, total-cost>. Then the result can be
> turned into a list.

Now added a new function list.c list_qsort() so that it can be used in
the future.

>
> - Maybe the new helper functions in nodeAppend.c could get names
> starting with exec_append_, to match the style of
> exec_append_initialize_next().

Done.

>
> - There's a superfluous whitespace change in add_paths_to_append_rel.

Didn't find exactly which, but I guess the attached latest patch does
not have it.

>>> - In get_append_num_workers, instead of the complicated formula with
>>> log() and 0.693, just add the list lengths and call fls() on the
>>> result. Integer arithmetic FTW!
>>
>> Yeah fls() could be used. BTW I just found that costsize.c already has
>> this defined in the same way I did:
>> #define LOG2(x) (log(x) / 0.693147180559945)
>> May be we need to shift this to some common header file.
>
> LOG2() would make sense if you're working with a value represented as
> a double, but if you have an integer input, I think fls() is better.

Used fls() now.

Attachment Content-Type Size
ParallelAppend_v8.patch application/octet-stream 53.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-03-22 08:53:14 Re: Logical decoding on standby
Previous Message Seki, Eiji 2017-03-22 08:25:16 Re: BRIN de-summarize ranges