Re: [DESIGN] ParallelAppend

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: [DESIGN] ParallelAppend
Date: 2015-10-27 05:46:15
Message-ID: 9A28C8860F777E439AA12E8AEA7694F80115EABB@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Robert Haas
> Sent: Monday, October 26, 2015 8:53 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers(at)postgresql(dot)org; Amit Kapila; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
>
> On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > I entirely agree with your suggestion.
> >
> > We may be able to use an analogy between PartialSeqScan and the
> > parallel- aware Append node.
> > PartialSeqScan fetches blocks pointed by the index on shared memory
> > segment, thus multiple workers eventually co-operate to scan a table
> > using round-robin manner even though individual worker fetches comb-
> > shaped blocks.
> > If we assume individual blocks are individual sub-plans on the parallel
> > aware Append, it performs very similar. A certain number of workers
> > (more than zero) is launched by Gather node, then the parallel aware
> > Append node fetches one of the sub-plans if any.
>
> Exactly, except for the part about the blocks being "comb-shaped",
> which doesn't seem to make sense.
>
> > I think, this design also gives additional flexibility according to
> > the required parallelism by the underlying sub-plans.
> > Please assume the "PartialSeqScan on p2" in the above example wants
> > 3 workers to process the scan, we can expand the virtual array of
> > the sub-plans as follows. Then, if Gather node kicks 5 workers,
> > individual workers are assigned on some of plans. If Gather node
> > could kick less than 5 workers, the first exit worker picks the
> > second sub-plan, then it eventually provides the best parallelism.
> >
> > +--------+
> > |sub-plan | * Sub-Plan 1 ... Index Scan on p1
> > |index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2
> > |shared | * Sub-Plan 2 ... PartialSeqScan on p2
> > |memory | * Sub-Plan 2 ... PartialSeqScan on p2
> > +---------+ * Sub-Plan 3 ... Index Scan on p3
>
> I don't think the shared memory chunk should be indexed by worker, but
> by sub-plan. So with 3 subplans, we would initially have [0,0,0].
> The first worker would grab the first subplan, and we get [1,0,0].
> The second worker grabs the third subplan, so now we have [1,0,1].
> The remaining workers can't join the execution of those plans because
> they are not truly parallel, so they all gang up on the second
> subplan. At 5 workers we have [1,3,1]. Workers need not ever
> decrement the array entries because they only pick a new sub-plan when
> the one they picked previously is exhausted; thus, at the end of the
> plan, each element in the array shows the total number of workers that
> touched it at some point during its execution.
>
Sorry, I could not get the point in the above explanation.
The 1st worker would grab the first subplan, then [1,0,0]. It's OK.
The 2nd worker would grab the last subplan, then [1,0,1]. It's
understandable, even though I'm uncertain why it does not pick up
the 2nd one.
Why remaining worker have to gang up to kick the 2nd (PartialSeqScan;
that is parallel-aware execution node)?

Even if only one worker is kicked towards the PartialSeqScan, it tries
to scan the relation sequentially (because of no parallel workers actually).
Then, once other worker gets additionally launched to scan same relation,
both of the worker begins to co-operate using a common block-index kept
on the shared memory.
So, do we need to wait for completion of non-parallel aware nodes here?

I assume that it is better to launch PartialSeqScan, even if one worker,
than synchronization, because other worker can join the execution later.

> > For more generic plan construction, Plan node may have a field for
> > number of "desirable" workers even though most of plan-nodes are not
> > parallel aware, and it is not guaranteed.
> > In above case, the parallel aware Append will want 5 workers in total
> > (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
> > of Gather node how many workers are actually launched, however, it
> > will give clear information how many workers are best.
>
> Yeah, maybe. I haven't thought about this deeply just yet, but I
> agree it needs more consideration.
>
OK, I'll build a patch including the concept.

> >> First, we can teach Append that, when running in parallel,
> >> it should initialize a chunk of dynamic shared memory with an array
> >> indicating how many workers are currently working on each subplan.
> > Can the parallel-aware Append can recognize the current mode using
> > MyBgworkerEntry whether it is valid or not?
>
> No - that would be quite wrong. What it needs to do is define
> ExecAppendEstimate and ExecAppendInitializeDSM and call those
> functions from ExecParallelEstimate and ExecParallelInitializeDSM. It
> also needs to define a third callback ExecAppendInitializeWorker which
> will be called from the ExecParallelInitializeWorker function added by
> the latest partial seq scan patch. ExecAppendEstimate must estimate
> required shared memory usage for the shared memory state;
> ExecAppendInitializeDSM must initialize that state, store a pointer to
> it in the planstate note, and add a TOC entry; ExecAppendWorker will
> run in the worker and should look up the TOC entry and store the
> result in the same planstate node that ExecAppendInitializeDSM
> populated in the leader. Then ExecAppend can decide what to do based
> on whether that pointer is set, and based on the data to which it
> points.
>
Thanks for the clear direction.

> Are you going to look at implementing this?
>
I feel the scale of implementation is not large, if Append node itself
is not capable to launch a new worker process. Let me try it.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-10-27 06:59:00 Re: TODO list updates
Previous Message Rajeev rastogi 2015-10-27 05:29:18 Re: Dangling Client Backend Process