Re: [DESIGN] ParallelAppend

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: [DESIGN] ParallelAppend
Date: 2015-10-26 01:23:46
Message-ID: 9A28C8860F777E439AA12E8AEA7694F80115DEB8@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > ----------
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
>
> Now that I've got more of the parallel infrastructure committed, I've
> been starting to have a little time to think about what we might want
> to do after we get PartialSeqScan committed. I'm positive on doing
> something with the Append node and parallelism, but I'm negative on
> doing what you've described here.
>
> I don't think the Append node has any business launching workers.
> That's the job of Gather. Append may need to do something
> parallel-aware, but starting workers is not that thing. Without
> making any changes at all to Append, we can use it like this:
>
> Gather
> -> Append
> -> Partial Seq Scan on p1
> -> Partial Seq Scan on p2
> -> Partial Seq Scan on p3
>
> The Gather node will spin up workers and in each worker we will ask
> the Append nodes for tuples. Append will ask the first
> not-yet-completed child for tuples, so the workers will cooperatively
> scan first p1, then p2, then p3. This is great: instead of merely
> doing a parallel seq scan of a single table, we can do a parallel seq
> scan of a partitioned table. However, there are two improvements we
> can make. First, we can teach Append that, when running in parallel,
> it should initialize a chunk of dynamic shared memory with an array
> indicating how many workers are currently working on each subplan.
> Each new worker should join a subplan with the minimum number of
> workers, work on that one until it's completely, and then pick a new
> subplan. This minimizes contention. Second, we can teach the Append
> to serve as a parent not only for intrinsically parallel nodes like
> Partial Seq Scan, but also for other nodes, say, an Index Scan. When
> an Append is running in parallel but with non-parallel-aware children,
> each such child can be claimed by at most one worker process and will
> be run to completion by that worker process. For example:
>
> Gather
> -> Append
> -> Index Scan on p1
> -> Partial Seq Scan on p2
> -> Index Scan on p3
>
> The first worker which executes the Append should begin the index scan
> on p1 and the second should begin the index scan on p3. The remaining
> workers, and those two once they're finished, can work on p2.
>
> Proceeding in this way, I don't think we need a separate Parallel
> Append node. Rather, we can just give the existing Append node some
> extra smarts that are used only when it's running in parallel.
>
I entirely agree with your suggestion.

We may be able to use an analogy between PartialSeqScan and the
parallel- aware Append node.
PartialSeqScan fetches blocks pointed by the index on shared memory
segment, thus multiple workers eventually co-operate to scan a table
using round-robin manner even though individual worker fetches comb-
shaped blocks.
If we assume individual blocks are individual sub-plans on the parallel
aware Append, it performs very similar. A certain number of workers
(more than zero) is launched by Gather node, then the parallel aware
Append node fetches one of the sub-plans if any.

I think, this design also gives additional flexibility according to
the required parallelism by the underlying sub-plans.
Please assume the "PartialSeqScan on p2" in the above example wants
3 workers to process the scan, we can expand the virtual array of
the sub-plans as follows. Then, if Gather node kicks 5 workers,
individual workers are assigned on some of plans. If Gather node
could kick less than 5 workers, the first exit worker picks the
second sub-plan, then it eventually provides the best parallelism.

+--------+
|sub-plan | * Sub-Plan 1 ... Index Scan on p1
|index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2
|shared | * Sub-Plan 2 ... PartialSeqScan on p2
|memory | * Sub-Plan 2 ... PartialSeqScan on p2
+---------+ * Sub-Plan 3 ... Index Scan on p3

Here is no matter even if Append node has multiple parallel-aware
sub-plans. When "PartialSeqScan on p4" is added, all we need to do
is expand the above virtual array of the sub-plans.

For more generic plan construction, Plan node may have a field for
number of "desirable" workers even though most of plan-nodes are not
parallel aware, and it is not guaranteed.
In above case, the parallel aware Append will want 5 workers in total
(2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
of Gather node how many workers are actually launched, however, it
will give clear information how many workers are best.

> First, we can teach Append that, when running in parallel,
> it should initialize a chunk of dynamic shared memory with an array
> indicating how many workers are currently working on each subplan.
>
Can the parallel-aware Append can recognize the current mode using
MyBgworkerEntry whether it is valid or not?

> We can also push other things in between the Gather and the Append,
> which wouldn't be possible in your design. For example, consider a
> join between a partitioned table p and an unpartitioned table q. We
> could do this:
>
> Gather
> -> Nested Loop
> -> Append
> -> Index Scan on p1
> -> Partial Seq Scan on p2
> -> Index Scan on p3
> -> Index Scan on q
> Index Cond q.x = p.x
>
> That's a pretty sweet plan. Assuming p1, p2, and p3 are all
> reasonably large, we could probably benefit from throwing at least 3
> processes at this plan tree - maybe more, if p2 is really big. Only
> one process can work on each of p1 and p3, but p2, since it has a
> truly parallel plan, can soak up as many as we want to throw at it
> (although performance may top out at some point if we're I/O-bound).
>
We can also leverage the parallel capability on hash-join and merge-join
(even though it is not all the cases; in case when Append node runs on
the partitioned table with CHECK() constraint).
It is the reason why my colleague works for feature of the join before
append.

http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E088606D16@BPXM01GP.gisp.nec.co.jp

> Sorry for taking so long to give a really substantive reply on this,
> but it wasn't until this last week or so that I really had time to
> think about this in detail.
>
No worry, I also couldn't make a valid progress due to various jobs
not only community works...

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kisung Kim 2015-10-26 01:52:39 questions about PG update performance
Previous Message Michael Paquier 2015-10-26 00:59:00 About BoringSSL, an OpenSSL fork