Re: Parallel Append implementation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-04-07 15:05:32
Message-ID: 20170407150532.6jbaqrhdnezb5tiw@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-04-07 11:44:39 +0530, Amit Khandekar wrote:
> On 6 April 2017 at 07:33, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
> >> This is what the earlier versions of my patch had done : just add up
> >> per-subplan parallel_workers (1 for non-partial subplan and
> >> subpath->parallel_workers for partial subplans) and set this total as
> >> the Append parallel_workers.
> >
> > I don't think that's great, consider e.g. the case that you have one
> > very expensive query, and a bunch of cheaper ones. Most of those workers
> > wouldn't do much while waiting for the the expensive query. What I'm
> > basically thinking we should do is something like the following
> > pythonesque pseudocode:
> >
> > best_nonpartial_cost = -1
> > best_nonpartial_nworkers = -1
> >
> > for numworkers in 1...#max workers:
> > worker_work = [0 for x in range(0, numworkers)]
> >
> > nonpartial_cost += startup_cost * numworkers
> >
> > # distribute all nonpartial tasks over workers. Assign tasks to the
> > # worker with the least amount of work already performed.
> > for task in all_nonpartial_subqueries:
> > least_busy_worker = worker_work.smallest()
> > least_busy_worker += task.total_nonpartial_cost
> >
> > # the nonpartial cost here is the largest amount any single worker
> > # has to perform.
> > nonpartial_cost += worker_work.largest()
> >
> > total_partial_cost = 0
> > for task in all_partial_subqueries:
> > total_partial_cost += task.total_nonpartial_cost
> >
> > # Compute resources needed by partial tasks. First compute how much
> > # cost we can distribute to workers that take shorter than the
> > # "busiest" worker doing non-partial tasks.
> > remaining_avail_work = 0
> > for i in range(0, numworkers):
> > remaining_avail_work += worker_work.largest() - worker_work[i]
> >
> > # Equally divide up remaining work over all workers
> > if remaining_avail_work < total_partial_cost:
> > nonpartial_cost += (worker_work.largest - remaining_avail_work) / numworkers
> >
> > # check if this is the best number of workers
> > if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
> > best_nonpartial_cost = worker_work.largest
> > best_nonpartial_nworkers = nworkers
> >
> > Does that make sense?
>
> Yeah, I gather what you are trying to achieve is : allocate number of
> workers such that the total cost does not exceed the cost of the first
> non-partial plan (i.e. the costliest one, because the plans are sorted
> by descending cost).
>
> So for non-partial costs such as (20, 10, 5, 2) allocate only 2
> workers because the 2nd worker will execute (10, 5, 2) while 1st
> worker executes (20).
>
> But for costs such as (4, 4, 4, .... 20 times), the logic would give
> us 20 workers because we want to finish the Append in 4 time units;
> and this what we want to avoid when we go with
> don't-allocate-too-many-workers approach.

I guess, my problem is that I don't agree with that as a goal in an of
itself. If you actually want to run your query quickly, you *want* 20
workers here. The issues of backend startup overhead (already via
parallel_setup_cost), concurrency and such cost should be modelled, not
burried in a formula the user can't change. If we want to make it less
and less likely to start more workers we should make that configurable,
not the default.
I think there's some precedent taken from the parallel seqscan case,
that's not actually applicable here. Parallel seqscans have a good
amount of shared state, both on the kernel and pg side, and that shared
state reduces gains of increasing the number of workers. But with
non-partial scans such shared state largely doesn't exist.

> > especially if we get partitionwise joins.
>
> About that I am not sure, because we already have support for parallel
> joins, so wouldn't the join subpaths corresponding to all of the
> partitions be partial paths ? I may be wrong about that.

We'll probably generate both, and then choose the cheaper one. The
startup cost for partitionwise joins should usually be a lot cheaper
(because e.g. for hashtables we'll generate smaller hashtables), and we
should have less cost of concurrency.

> But if the subplans are foreign scans, then yes all would be
> non-partial plans. This may provoke off-topic discussion, but here
> instead of assigning so many workers to all these foreign plans and
> all those workers waiting for the results, a single asynchronous
> execution node (which is still in the making) would be desirable
> because it would do the job of all these workers.

That's something that probably shouldn't be modelled throug a parallel
append, I agree - it shouldn't be too hard to develop a costing model
for that however.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-04-07 15:12:28 Re: Supporting huge pages on Windows
Previous Message Masahiko Sawada 2017-04-07 14:56:05 Re: Transactions involving multiple postgres foreign servers