Re: parallelize queries containing initplans

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallelize queries containing initplans
Date: 2017-03-15 21:04:01
Message-ID: CAGz5QC+uHOq78GCika3fbgRyN5zgiDR9Dd1Th5kENF+UpnPomQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Based on that idea, I have modified the patch such that it will
> compute the set of initplans Params that are required below gather
> node and store them as bitmap of initplan params at gather node.
> During set_plan_references, we can find the intersection of external
> parameters that are required at Gather or nodes below it with the
> initplans that are passed from same or above query level. Once the set
> of initplan params are established, we evaluate those (if they are not
> already evaluated) before execution of gather node and then pass the
> computed value to each of the workers. To identify whether a
> particular param is parallel safe or not, we check if the paramid of
> the param exists in initplans at same or above query level. We don't
> allow to generate gather path if there are initplans at some query
> level below the current query level as those plans could be
> parallel-unsafe or undirect correlated plans.

I would like to mention different test scenarios with InitPlans that
we've considered while developing and testing of the patch.

An InitPlan is a subselect that doesn't take any reference from its
immediate outer query level and it returns a param value. For example,
consider the following query:

QUERY PLAN
------------------------------
Seq Scan on t1
Filter: (k = $0)
allParams: $0
InitPlan 1 (returns $0)
-> Aggregate
-> Seq Scan on t3
In this case, the InitPlan is evaluated once when the filter is
checked for the first time. For subsequent checks, we need not
evaluate the initplan again since we already have the value. In our
approach, we parallelize the sequential scan by inserting a Gather
node on top of parallel sequential scan node. At the Gather node, we
evaluate the InitPlan before spawning the workers and pass this value
to the worker using dynamic shared memory. This yields the following
plan:
QUERY PLAN
---------------------------------------------------
Gather
Workers Planned: 2
Params Evaluated: $0
InitPlan 1 (returns $0)
-> Aggregate
-> Seq Scan on t3
-> Parallel Seq Scan on t1
Filter: (k = $0)
As Amit mentioned up in the thread, at a Gather node, we evaluate only
those InitPlans that are attached to this query level or any higher
one and are used under the Gather node. extParam at a node includes
the InitPlan params that should be passed from an outer node. I've
attached a patch to show extParams and allParams for each node. Here
is the output with that patch:
QUERY PLAN
---------------------------------------------------
Gather
Workers Planned: 2
Params Evaluated: $0
allParams: $0
InitPlan 1 (returns $0)
-> Finalize Aggregate
-> Gather
Workers Planned: 2
-> Partial Aggregate
-> Parallel Seq Scan on t3
-> Parallel Seq Scan on t1
Filter: (k = $0)
allParams: $0
extParams: $0
In this case, $0 is included in extParam of parallel sequential scan
and the InitPlan corresponding to this param is attached to the same
query level that contains the Gather node. Hence, we evaluate $0 at
Gather and pass it to workers.

But, for generating a plan like this requires marking an InitPlan
param as parallel_safe. We can't mark all params as parallel_safe
because of correlated subselects. Hence, in
max_parallel_hazard_walker, the only params marked safe are InitPlan
params from current or outer query level. An InitPlan param from inner
query level isn't marked safe since we can't evaluate this param at
any Gather node above the current node(where the param is used). As
mentioned by Amit, we also don't allow generation of gather path if
there are InitPlans at some query level below the current query level
as those plans could be parallel-unsafe or undirect correlated plans.

I've attached a script file and its output containing several
scenarios relevant to InitPlans. I've also attached the patch for
displaying extParam and allParam at each node. This patch can be
applied on top of pq_pushdown_initplan_v3.patch. Please find the
attachments.

> This restricts some of the cases for parallelism like when initplans
> are below gather node, but the patch looks better. We can open up
> those cases if required in a separate patch.
+1. Unfortunately, this patch doesn't enable parallelism for all
possible cases with InitPlans. Our objective is to keep things simple
and clean. Still, TPC-H q22 runs 2.5~3 times faster with this patch.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
script.out application/octet-stream 19.0 KB
script.sql application/sql 2.4 KB
0002-Show-extParams-and-allParams-in-explain.patch binary/octet-stream 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-03-15 21:33:46 Re: WIP: Faster Expression Processing v4
Previous Message Andres Freund 2017-03-15 20:57:32 Re: WIP: Faster Expression Processing v4