Re: Parallel Seq Scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-04-22 20:56:03
Message-ID: CA+TgmoZDXEr7ke4iYFX+ex+H4HjO48PV7PCCYAQ1o70EH=HyRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I have implemented this idea (note that I have to expose a new API
> shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
> we sum_mq* to call shm_mq_detach) and apart this I have fixed other
> problems reported on this thread:
>
> 1. Execution of initPlan by master backend and then pass the
> required PARAM_EXEC parameter values to workers.
> 2. Avoid consuming dsm's by freeing the parallel context after
> the last tuple is fetched.
> 3. Allow execution of Result node in worker backend as that can
> be added as a gating filter on top of PartialSeqScan.
> 4. Merged parallel heap scan descriptor patch
>
> To apply the patch, please follow below sequence:
>
> HEAD Commit-Id: 4d930eee
> parallel-mode-v9.patch [1]
> assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl in
> the patch)
> parallel_seqscan_v14.patch (Attached with this mail)

Thanks, this version looks like an improvement. However, I still see
some problems:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace). We have to eliminate everything
that assumes that what's under a funnel will be, specifically, a
partial sequential scan. To make this work properly, I think we should
introduce a new function that recurses over the plan tree and invokes
some callback for each plan node. I think this could be modeled on
this code from ExplainNode(), beginning around line 1593:

/* initPlan-s */
if (planstate->initPlan)
ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);

/* lefttree */
if (outerPlanState(planstate))
ExplainNode(outerPlanState(planstate), ancestors,
"Outer", NULL, es);

/* righttree */
if (innerPlanState(planstate))
ExplainNode(innerPlanState(planstate), ancestors,
"Inner", NULL, es);

/* special child plans */
switch (nodeTag(plan))
{
/* a bunch of special cases */
}

/* subPlan-s */
if (planstate->subPlan)
ExplainSubPlans(planstate->subPlan, ancestors, "SubPlan", es);

The new function would do the same sort of thing, but instead of
explaining each node, it would invoke a callback for each node.
Possibly explain.c could use it instead of having hard-coded logic.
Possibly it should use the same sort of return-true convention as
expression_tree_walker, query_tree_walker, and friends. So let's call
it planstate_tree_walker.

Now, instead of directly invoking logic specific to parallel
sequential scan, it should call planstate_tree_walker() on its
lefttree and pass a new function ExecParallelEstimate() as the
callback. That function ignores any node that's not parallel aware,
but when it sees a partial sequential scan (or, in the future, some a
parallel bitmap scan, parallel sort, or what have you) it does the
appropriate estimation work. When ExecParallelEstimate() finishes, we
InitializeParallelDSM(). Then, we call planstate_tree_walker() on the
lefttree again, and this time we pass another new function
ExecParallelInitializeDSM(). Like the previous one, that ignores the
callbacks from non-parallel nodes, but if it hits a parallel node,
then it fills in the parallel bits (i.e. ParallelHeapScanDesc for a
partial sequential scan).

- shm_mq_from_handle() is probably reasonable, but can we rename it
shm_mq_get_queue()?

- It's hard to believe this is right:

+ if (parallelstmt->inst_options)
+ receiver = None_Receiver;

Really? Flush the tuples if there are *any instrumentation options
whatsoever*? At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds. Something that
includes a shm_mq obviously isn't a "parse" node in any meaningful
sense anyway.

- I don't think you need both setup cost and startup cost. Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other. I suggest removing
parallel_startup_cost and keeping parallel_setup_cost.

- In cost_funnel(), I don't think it's right to divide the run cost by
nWorkers + 1. Suppose we've got a plan that looks like this:

Funnel
-> Hash Join
-> Partial Seq Scan on a
-> Hash
-> Seq Scan on b

The sequential scan on b is going to get executed once per worker,
whereas the effort for the sequential scan on a is going to be divided
over all the workers. So the right way to cost this is as follows:

(a) The cost of the partial sequential scan on a is equal to the cost
of a regular sequential scan, plus a little bit of overhead to account
for communication via the ParallelHeapScanDesc, divided by the number
of workers + 1.
(b) The cost of the remaining nodes under the funnel works normally.
(c) The cost of the funnel is equal to the cost of the hash join plus
number of tuples multiplied by per-tuple communication overhead plus a
large fixed overhead reflecting the time it takes the workers to
start.

- While create_parallelscan_paths() is quite right to limit the number
of workers to no more than the number of pages, it's pretty obvious
that in practice that's way too conservative. I suggest we get
significantly more aggressive about that, like limiting ourselves to
one worker per thousand pages. We don't really know exactly what the
costing factors should be here just yet, but we certainly know that
spinning up lots of workers to read a handful of pages each must be
dumb. And we can save a significant amount of planning time here by
not bothering to generate parallel paths for little tiny relations.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-04-22 20:58:31 Re: Streaming replication and WAL archive interactions
Previous Message Bruce Momjian 2015-04-22 20:55:25 Re: Turning off HOT/Cleanup sometimes