From: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> |
---|---|

To: | Robert Haas <robertmhaas(at)gmail(dot)com> |

Cc: | Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Re: Parallel Append implementation |

Date: | 2017-03-22 08:49:13 |

Message-ID: | CAJ3gD9e-_qFOBkoQCR7OV6PjJ2uMjftEPSCHMBNPT7ZohSzCow@mail.gmail.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

Attached is the updated patch that handles the changes for all the

comments except the cost changes part. Details about the specific

changes are after the cost-related points discussed below.

>> I wanted to take into account persubpath parallel_workers for total

>> cost of Append. Suppose the partial subpaths have per worker total

>> costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2

>> Append workers available. So according to what you say, the total cost

>> is 9. With persubplan parallel_workers taken into account, total cost

>> = (3*2 + 3*8 * 3*4)/2 = 21.

> But that case never happens, because the parallel workers for the

> append is always at least as large as the number of workers for any

> single child.

Yeah, that's right. I will use this approach for partial paths.

For non-partial paths, I was checking following 3 options :

Option 1. Just take the sum of total non-partial child costs and

divide it by number of workers. It seems to be getting close to the

actual cost.

Option 2. Calculate exact cost by an algorithm which I mentioned

before, which is pasted below for reference :

Persubpath cost : 20 16 10 8 3 1, with 3 workers.

After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the

times remaining are :

10 6 0 8 3 1

After 6 units (minimum of 10, 06, 08), the times remaining are :

4 0 0 2 3 1

After 2 units (minimum of 4, 2, 3), the times remaining are :

2 0 0 0 1 1

After 1 units (minimum of 2, 1, 1), the times remaining are :

1 0 0 0 0 0

After 1 units (minimum of 1, 0 , 0), the times remaining are :

0 0 0 0 0 0

Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

Option 3. Get some approximation formula like you suggested. I am also

looking for such formula, just that some things are not clear to me.

The discussion of the same is below ...

>>> 2. Next, estimate the cost of the nonpartial paths. To do this, make

>>> an array of Cost of that length and initialize all the elements to

>>> zero, then add the total cost of each nonpartial plan in turn to the

>>> element of the array with the smallest cost, and then take the maximum

>>> of the array elements as the total cost of the nonpartial plans. Add

>>> this to the result from step 1 to get the total cost.

>>

>> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,

>> 15) , and so the max is 15 ? I surely am misinterpreting this.

> No. If you have costs 8, 5, and 2 and only one process, cost is 15.

> If you have two processes then for costing purposes you assume worker

> 1 will execute the first path (cost 8) and worker 2 will execute the

> other two (cost 5 + 2 = 7), so the total cost is 8. If you have three

> workers, the cost will still be 8, because there's no way to finish

> the cost8 path in less than 8 units of work.

So the part that you suggested about adding up total cost in turn to

the smallest cost; this suggestion applies to only 1 worker right ?

For more than worker, are you suggesting to use some algorithm similar

to the one I suggested in option 2 above ? If yes, it would be great

if you again describe how that works for multiple workers. Or is it

that you were suggesting some simple approximate arithmetic that

applies to multiple workers ?

Like I mentioned, I will be happy to get such simple approximation

arithmetic that can be applied for multiple worker case. The one logic

I suggested in option 2 is something we can keep as the last option.

And option 1 is also an approximation but we would like to have a

better approximation. So wanted to clear my queries regarding option

3.

----------

Details about all the remaining changes in updated patch are below ...

On 20 March 2017 at 17:29, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Mar 17, 2017 at 1:12 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:

>>> - The substantive changes in add_paths_to_append_rel don't look right

>>> either. It's not clear why accumulate_partialappend_subpath is

>>> getting called even in the non-enable_parallelappend case. I don't

>>> think the logic for the case where we're not generating a parallel

>>> append path needs to change at all.

>>

>> When accumulate_partialappend_subpath() is called for a childrel with

>> a partial path, it works just like accumulate_append_subpath() when

>> enable_parallelappend is false. That's why, for partial child path,

>> the same function is called irrespective of parallel-append or

>> non-parallel-append case. May be mentioning this in comments should

>> suffice here ?

>

> I don't get it. If you can get the same effect by changing something

> or not changing it, presumably it'd be better to not change it. We

> try not to change things just because we can; the change should be an

> improvement in some way.

>

>>> - When parallel append is enabled, I think add_paths_to_append_rel

>>> should still consider all the same paths that it does today, plus one

>>> extra. The new path is a parallel append path where each subpath is

>>> the cheapest subpath for that childrel, whether partial or

>>> non-partial. If !enable_parallelappend, or if all of the cheapest

>>> subpaths are partial, then skip this. (If all the cheapest subpaths

>>> are non-partial, it's still potentially useful.)

>>

>> In case of all-partial childrels, the paths are *exactly* same as

>> those that would have been created for enable_parallelappend=off. The

>> extra path is there for enable_parallelappend=on only when one or more

>> of the child rels do not have partial paths. Does this make sense ?

>

> No, I don't think so. Imagine that we have three children, A, B, and

> C. The cheapest partial paths have costs of 10,000 each. A, however,

> has a non-partial path with a cost of 1,000. Even though A has a

> partial path, we still want to consider a parallel append using the

> non-partial path because it figures to be hugely faster.

Right. Now that we want to consider both cheapest partial and cheapest

non-partial path, I now get what you were saying about having an extra

path for parallel_append. I have done all of the above changes. Now we

have an extra path for enable_parallelappend=true, besides the

non-parallel partial append path.

> - You've added a GUC (which is good) but not documented it (which is

> bad) or added it to postgresql.conf.sample (also bad).

Done.

>

> - You've used a loop inside a spinlock-protected critical section,

> which is against project policy. Use an LWLock; define and document a

> new builtin tranche ID.

Done. Used LWlock for the parallel append synchronization. But I am

not sure what does "document the new builtin trancheID" mean. Didn't

find a readme which documents tranche ids.

For setting pa_finished=true when a partial plan finished, earlier it

was using Spinlock. Now it does not use any synchronization. It was

actually earlier using it because there was another field num_workers,

but it is not needed since there is no num_workers. I was considering

whether to use atomic read and write API in atomics.c for pa_finished.

But from what I understand, just a plain read/write is already atomic.

We require them only if there are some compound operations like

increment, exchange, etc.

>

> - The comment for pa_finished claims that it is the number of workers

> executing the subplan, but it's a bool, not a count; I think this

> comment is just out of date.

Done.

>

> - paths_insert_sorted_by_cost() is a hand-coded insertion sort. Can't

> we find a way to use qsort() for this instead of hand-coding a slower

> algorithm? I think we could just create an array of the right length,

> stick each path into it from add_paths_to_append_rel, and then qsort()

> the array based on <is-partial, total-cost>. Then the result can be

> turned into a list.

Now added a new function list.c list_qsort() so that it can be used in

the future.

>

> - Maybe the new helper functions in nodeAppend.c could get names

> starting with exec_append_, to match the style of

> exec_append_initialize_next().

Done.

>

> - There's a superfluous whitespace change in add_paths_to_append_rel.

Didn't find exactly which, but I guess the attached latest patch does

not have it.

>>> - In get_append_num_workers, instead of the complicated formula with

>>> log() and 0.693, just add the list lengths and call fls() on the

>>> result. Integer arithmetic FTW!

>>

>> Yeah fls() could be used. BTW I just found that costsize.c already has

>> this defined in the same way I did:

>> #define LOG2(x) (log(x) / 0.693147180559945)

>> May be we need to shift this to some common header file.

>

> LOG2() would make sense if you're working with a value represented as

> a double, but if you have an integer input, I think fls() is better.

Used fls() now.

Attachment | Content-Type | Size |
---|---|---|

ParallelAppend_v8.patch | application/octet-stream | 53.0 KB |

- Re: Parallel Append implementation at 2017-03-20 11:59:22 from Robert Haas

- Re: Parallel Append implementation at 2017-03-23 00:25:41 from Robert Haas

From | Date | Subject | |
---|---|---|---|

Next Message | Craig Ringer | 2017-03-22 08:53:14 | Re: Logical decoding on standby |

Previous Message | Seki, Eiji | 2017-03-22 08:25:16 | Re: BRIN de-summarize ranges |