parallelize queries containing subplans

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: parallelize queries containing subplans
Date: 2016-12-28 06:17:59
Message-ID: CAA4eK1+e8Z45D2n+rnDMDYsVEb5iW7jqaCH_tvPMYau=1Rru9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Currently, queries that have references to SubPlans or
AlternativeSubPlans are considered parallel-restricted. I think we
can lift this restriction in many cases especially when SubPlans are
parallel-safe. To make this work, we need to propagate the
parallel-safety information from path node to plan node and the same
could be easily done while creating a plan. Another option could be
that instead of propagating parallel-safety information from path to
plan, we can find out from the plan if it is parallel-safe (doesn't
contain any parallel-aware node) by traversing whole plan tree, but I
think it is a waste of cycles. Once we have parallel-safety
information in the plan, we can use that for detection of
parallel-safe expressions in max_parallel_hazard_walker(). Finally,
we can pass all the subplans to workers during plan serialization in
ExecSerializePlan(). This will enable workers to execute subplans
that are referred in parallel part of the plan. Now, we might be able
to optimize it such that we pass only subplans that are referred in
parallel portion of plan, but I am not sure if it is worth the trouble
because it is one-time cost and much lesser than other things we do
(like creating
dsm, launching workers).

Attached patch implements the above idea. This will enable
parallelism for queries containing un-correlated subplans, an example
of which is as follows:

set parallel_tuple_cost=0;
set parallel_setup_cost=0;
set min_parallel_relation_size=50;

create table t1 (i int, j int, k int);
create table t2 (i int, j int, k int);

insert into t1 values (generate_series(1,10)*random(),
generate_series(5,50)*random(),
generate_series(8,80)*random());
insert into t2 values (generate_series(4,10)*random(),
generate_series(5,90)*random(),
generate_series(2,10)*random());

Plan without Patch
-----------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
QUERY PLAN
---------------------------------------------------------------
Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4)
Filter: (i = ANY ('{1,2,3}'::integer[]))
(5 rows)

Plan with Patch
------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
QUERY PLAN
-------------------------------------------------------------------------
Gather (cost=110.84..325.30 rows=8395 width=12)
Workers Planned: 1
-> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4)
Filter: (i = ANY ('{1,2,3}'::integer[]))
(7 rows)

We have observed that Q-16 in TPC-H have been improved with the patch
and the analysis of same will be shared by my colleague Rafia.

Now, we can further extend this to parallelize queries containing
correlated subplans like below:

explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
QUERY PLAN
-------------------------------------------------------------
Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12)
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4)
Filter: (i = t1.i)
(5 rows)

In the above query, Filter on t2 (i = t1.i) generates Param node which
is a parallel-restricted node, so such queries won't be able to use
parallelism even with the patch. I think we can mark Params which
refer to same level as parallel-safe and I think we have this
information (node-> varlevelsup/ phlevelsup/ agglevelsup) available
when we replace correlation vars (SS_replace_correlation_vars). The
reason why it is not advisable to mark Params that don't refer to same
query level as parallel-safe is that can lead to plans like below:

Foo
-> Gather
-> Bar
SubPlan 1 (SubPlan refers to Foo)

Now, this problem is tricky because we need to pass all such Params
each time we invoke Gather. That also is doable with much more
effort, but I am not sure if it is worth because all of the use cases
I have seen in TPC-H (Q-2) or TPC-DS (Q-6) always uses SubPlans that
refer to same query level.

Yet, another useful enhancement in this area could be to consider both
parallel and non-parallel paths for subplans. As of now, we consider
the cheapest/best path and form subplan from it, but it is quite
possible that instead of choosing parallel path (in case it is
cheapest) at subplan level, the non-parallel path at subplan level
could be beneficial when upper plan can use parallelism. I think this
will be a separate project in itself if we want to do this and based
on my study of TPC-H and TPC-DS queries, I am confident that this will
be helpful in certain queries at higher scale factors.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
pq_pushdown_subplan_v1.patch application/octet-stream 8.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-12-28 06:38:31 Re: ALTER TABLE parent SET WITHOUT OIDS and the oid column
Previous Message Michael Paquier 2016-12-28 06:07:52 Re: Hooks