Re: [Proposal] Table partition + join pushdown

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Taiki Kondo <tai-kondo(at)yk(dot)jp(dot)nec(dot)com>
Cc: Akio Iwaasa <aki-iwaasa(at)vt(dot)jp(dot)nec(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Proposal] Table partition + join pushdown
Date: 2015-09-29 02:45:47
Message-ID: 9A28C8860F777E439AA12E8AEA7694F80114C12A@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Taiki Kondo
> Sent: Thursday, September 24, 2015 8:06 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown
>
> Hello, KaiGai-san.
>
> Thank you for your comment, and sorry for late response.
>
> The attached patch is completely rewritten from previous patch[1], at your
> suggestion[2].
> Please find attached.
>
Thanks for your work, and let me introduce purpose of the work briefly,
because the last submission was August.

His work intends (1) to save resource consumption on tables join at this
moment, and (2) to provide an infrastructure of one parallel join scenario
once Funnel node gets capable.

Even if we construct partition tables, it is unavailable to utilize to
filter out candidate rows of join. In the result, size of Hash table
may grow more than necessity and it causes unnecessary nBatch increase.

Below is the scenario this project tries to tackle. In case when tables
join takes partitioned table on one side, usually, we once need to run
entire partitioned table unless we cannot drop particular child tables.

XXXXJoin cond (x = y)
-> Append
-> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0)
-> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1)
-> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2)
-> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3)
-> Hash
-> SeqScan on other_table

However, CHECK() constraint assigned on child tables give us hint which
rows in other side are never related to this join.
For example, all the rows in other_table to be joined with tbl_child_0
should have multiple number of 4 on hash_func(y). We may be able to omit
unrelated rows from the hash-table in this case, then it eventually allows
to reduce the size of hash table.

In case of INNER_JOIN, we can rewrite the query execution plan as below.

Append
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(y) % 4 = 0
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(y) % 4 = 1
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(y) % 4 = 2
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(y) % 4 = 3
-> SeqScan on other_table

Unrelated rows of Hash table is preliminarily, it allows to avoid hash
table split when its size reaches to work_mem limitation.

This join-pushdown is valuable on hash-join and merge-join if MJ takes
unsorted relation and number of rows to be sorted is performance factor.
Also, once Funnel gets capable to run Append on background worker, it
is also helpful to run NestLoop in parallel.

How about the opinion from third parties? I'm a bit biased, of course.

OK, below is the brief comment to patch.

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however, NestLoop
has no valuable scenario without parallel execution capability, and the
most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

* MultiExecHash() once put slot on outer_slot then move it to inner_slot
This patch add set_hash_references() to replace varno in the expression
of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't need
to re-assign slot.

I'll try to have deeper investigation, later.

> This patch contains following implementation, but I can't determine this is
> correct or wrong.
>
> 1. Cost estimation
> In this patch, additional row filter is implemented for Hash, Merge join and Nested
> Loop.
> I implemented cost estimation feature for this filter by watching other parts
> of filters,
> but I am not sure this implementation is correct.
>

@@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
* not all of the quals may get evaluated at each tuple.)
*/
startup_cost += qp_qual_cost.startup;
+ startup_cost += filter_qual_cost.startup +
+ filter_qual_cost.per_tuple * inner_path_rows;
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
run_cost += cpu_per_tuple * hashjointuples;

It seems to me it is not a fair estimation because inner_path_rows means
number of rows already filtered out, but filter_qual shall be applied to
all the inner input rows.

> 2. Workaround for failing assertion at allpaths.c
> In standard_join_search(), we expect to have a single rel at the final level.
> But this expectation is disappointed by join pushdown feature, because this will
> search for the combinations not searched by original standard_join_serch()
> at the final level. Therefore, once trying join pushdown is succeeded,
> failing assertion occurs in allpaths.c.
>
> So I implemented workaround by temporary set NULL to root->join_rel_level while
> trying join pushdown, but I think this implementation may be wrong.
>
It is my off-list suggestion. The standard_join_search expects root of
the partition tables will appear, but child tables are out of scope.
Once we try to push down the join under the append, we need to consider
table joins between inner table and every outer child tables, however,
it should not be visible to standard_join_search context.
From the standpoint of standard_join_search, it get an AppendPath that
represents a table join A and B, even if A contains 100 children and
join was pushed down on behalf of the AppendPath.
So, it is a reasonable way to set NULL on root->join_rel_level to avoid
unexpected RelOptInfo addition by build_join_rel().
"to avoid assertion" is one fact, however, intension of the code is
avoid pollution of the global data structure. ;-)

> 3. Searching pathkeys for Merge Join
> When join pushdown feature chooses merge join for pushed-down join operation,
> planner fails to create merge join node because it is unable to find pathkeys
> for this merge join. I found this is caused by skipping child table in finding
> pathkeys.
>
> I expect that it is for making planner faster, and I implemented that
> planner doesn't skip child table in finding pathkeys for merge join.
> But I am not sure this expectation is correct.
>
I like to recommend to omit MergeJoin support at the first version.

Thanks,

> Any comments/suggestions are welcome.
>
>
> Remarks :
> [1]
> http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@
> BPXM01GP.gisp.nec.co.jp
> [2]
> http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8011345B
> 6(at)BPXM15GP(dot)gisp(dot)nec(dot)co(dot)jp
>
> Best regards,
> --
> Taiki Kondo
>
> NEC Solution Innovators, Ltd.
>
> -----Original Message-----
> From: Kaigai Kouhei(海外 浩平) [mailto:kaigai(at)ak(dot)jp(dot)nec(dot)com]
> Sent: Tuesday, August 18, 2015 5:47 PM
> To: Kondo Taiki(近藤 太樹); pgsql-hackers(at)postgresql(dot)org
> Cc: Iwaasa Akio(岩浅 晃郎)
> Subject: RE: [Proposal] Table partition + join pushdown
>
> Hello Kondo-san,
>
> I briefly checked your patch. Let me put some comments about its design and
> implementation, even though I have no arguments towards its concept. :-)
>
> * Construction of RelOptInfo
>
> In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs
> RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It
> is entirely wrong implementation.
>
> I can understand we (may) have no RelOptInfo for the joinrel between
> tbl_child_0 and other_table, when planner investigates a joinpath to process join
> Append path with the other_table.
>
> > HashJoin
> > -> Append
> > -> SeqScan on tbl_child_0
> > -> SeqScan on tbl_child_1
> > -> SeqScan on tbl_child_2
> > -> SeqScan on tbl_child_3
> > -> Hash
> > -> SeqScan on other_table
> >
> How about these alternatives?
>
> - calls make_join_rel() to the pair of tbl_child_X and other_table
> by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
> construct a RelOptInfo for the supplied pair of relations, so
> relevant RelOptInfo shall be properly constructed.
> - make_join_rel() also calls add_paths_to_joinrel() towards all the
> join logic, so it makes easier to support to push down other join
> logic including nested-loop or custom-join.
> - It may be an idea to add an extra argument to make_join_rel() to
> inform expressions to be applied for tuple filtering on
> construction of inner hash table.
>
> * Why only SeqScan is supported
>
> I think it is the role of Hash-node to filter out inner tuples obviously unrelated
> to the join (if CHECK constraint of outer relation gives information), because
> this join-pushdown may be able to support multi-stacked pushdown.
>
> For example, if planner considers a path to join this Append-path with another
> relation, and join clause contains a reference to X?
>
> > Append
> > -> HashJoin
> > -> SeqScan on tbl_child_0
> > -> Hash ... Filter: hash_func(X) % 4 = 0
> > -> SeqScan on other_table
> > -> HashJoin
> > -> SeqScan on tbl_child_1
> > -> Hash ... Filter: hash_func(X) % 4 = 1
> > -> SeqScan on other_table
> > -> HashJoin
> > -> SeqScan on tbl_child_2
> > -> Hash ... Filter: hash_func(X) % 4 = 2
> > -> SeqScan on other_table
> > -> HashJoin
> > -> SeqScan on tbl_child_3
> > -> Hash ... Filter: hash_func(X) % 4 = 3
> > -> SeqScan on other_table
> >
> It may be a good challenge to consider additional join pushdown, even if subpaths
> of Append are HashJoin, not SeqScan, like:
>
> Append
> -> HashJoin
> -> HashJoin
> -> SeqScan on tbl_child_0
> -> Hash ... Filter: hash_func(X) % 4 = 0
> -> SeqScan on other_table
> -> Hash ... Filter: hash_func(X) % 4 = 0
> -> SeqScan on another_table
> -> HashJoin
> -> HashJoin
> -> SeqScan on tbl_child_1
> -> Hash ... Filter: hash_func(X) % 4 = 1
> -> SeqScan on other_table
> -> Hash ... Filter: hash_func(X) % 4 = 1
> -> SeqScan on another_table
> -> HashJoin
> -> HashJoin
> -> SeqScan on tbl_child_2
> -> Hash ... Filter: hash_func(X) % 4 = 2
> -> SeqScan on other_table
> -> Hash ... Filter: hash_func(X) % 4 = 2
> -> SeqScan on another_table
> -> HashJoin
> -> HashJoin
> -> SeqScan on tbl_child_3
> -> Hash ... Filter: hash_func(X) % 4 = 3
> -> SeqScan on other_table
> -> Hash ... Filter: hash_func(X) % 4 = 3
> -> SeqScan on another_table
>
> In this case, underlying nodes are not always SeqScan. So, only Hash-node can
> have filter clauses.
>
>
> * Way to handle expression nodes
>
> All this patch supported is CHECK() constraint with equal operation on INT4 data
> type. You can learn various useful infrastructure of PostgreSQL. For example, ...
> - expression_tree_mutator() is useful to make a copy of expression
> node with small modification
> - pull_varnos() is useful to check which relations are referenced
> by the expression node.
> - RestrictInfo->can_join is useful to check whether the clause is
> binary operator, or not.
>
>
> Anyway, reuse of existing infrastructure is the best way to make a reliable
> infrastructure and to keep the implementation simple.
>
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project KaiGai Kohei
> <kaigai(at)ak(dot)jp(dot)nec(dot)com>
>
>
> > -----Original Message-----
> > From: pgsql-hackers-owner(at)postgresql(dot)org
> > [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Taiki Kondo
> > Sent: Thursday, August 13, 2015 6:30 PM
> > To: pgsql-hackers(at)postgresql(dot)org
> > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
> > Subject: [HACKERS] [Proposal] Table partition + join pushdown
> >
> > Hi all,
> >
> > I saw the email about the idea from KaiGai-san[1], and I worked to
> > implement this idea.
> >
> > Now, I have implemented a part of this idea, so I want to propose this
> > feature.
> >
> > Patch attached just shows my concept of this feature.
> > It works fine for EXPLAIN, but it returns wrong result for other operations,
> sadly.
> >
> >
> >
> > Table partition + join pushdown
> > ===============================
> >
> > Motivation
> > ----------
> > To make join logic working more effectively, it is important to make
> > the size of relations smaller.
> >
> > Especially in Hash-join, it is meaningful to make the inner relation
> > smaller, because smaller inner relation can be stored within smaller hash table.
> > This means that memory usage can be reduced when joining with big tables.
> >
> >
> > Design
> > ------
> > It was mentioned by the email from KaiGai-san.
> > So I quote below here...
> >
> > ---- begin quotation ---
> > Let's assume a table which is partitioned to four portions, and
> > individual child relations have constraint by hash-value of its ID
> > field.
> >
> > tbl_parent
> > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
> > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
> > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
> > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)
> >
> > If someone tried to join another relation with tbl_parent using
> > equivalence condition, like X = tbl_parent.ID, we know inner tuples
> > that does not satisfies the condition
> > hash_func(X) % 4 = 0
> > shall be never joined to the tuples in tbl_child_0.
> > So, we can omit to load these tuples to inner hash table preliminary,
> > then it potentially allows to split the inner hash-table.
> >
> > Current typical plan structure is below:
> >
> > HashJoin
> > -> Append
> > -> SeqScan on tbl_child_0
> > -> SeqScan on tbl_child_1
> > -> SeqScan on tbl_child_2
> > -> SeqScan on tbl_child_3
> > -> Hash
> > -> SeqScan on other_table
> >
> > It may be rewritable to:
> >
> > Append
> > -> HashJoin
> > -> SeqScan on tbl_child_0
> > -> Hash ... Filter: hash_func(X) % 4 = 0
> > -> SeqScan on other_table
> > -> HashJoin
> > -> SeqScan on tbl_child_1
> > -> Hash ... Filter: hash_func(X) % 4 = 1
> > -> SeqScan on other_table
> > -> HashJoin
> > -> SeqScan on tbl_child_2
> > -> Hash ... Filter: hash_func(X) % 4 = 2
> > -> SeqScan on other_table
> > -> HashJoin
> > -> SeqScan on tbl_child_3
> > -> Hash ... Filter: hash_func(X) % 4 = 3
> > -> SeqScan on other_table
> > ---- end quotation ---
> >
> > In the quotation above, it was written that filter is set at Hash node.
> > But I implemented that filter is set at SeqScan node under Hash node.
> > In my opinion, filtering tuples is work of Scanner.
> >
> > Append
> > -> HashJoin
> > -> SeqScan on tbl_child_0
> > -> Hash
> > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
> > -> HashJoin
> > -> SeqScan on tbl_child_1
> > -> Hash
> > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
> > -> HashJoin
> > -> SeqScan on tbl_child_2
> > -> Hash
> > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
> > -> HashJoin
> > -> SeqScan on tbl_child_3
> > -> Hash
> > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3
> >
> >
> > API
> > ---
> > There are 3 new internal (static) functions to implement this feature.
> > try_hashjoin_pushdown(), which is main function of this feature, is
> > called from try_hashjoin_path(), and tries to push down HashPath under
> > the AppendPath.
> >
> > To do so, this function does following operations.
> >
> > 1. Check if this Hash-join can be pushed down under AppendPath
> > 2. To avoid an influence on other Path making operation,
> > copy inner path's RelOptInfo and make new SeqScan path from it.
> > At here, get CHECK() constraints from OUTER path, and convert its
> > Var node according to join condition. And also convert Var nodes
> > in join condition itself.
> > 3. Create new HashPath nodes between each sub-paths of AppendPath and
> > inner path made above.
> > 4. When operations from 1 to 3 are done for each sub-paths,
> > create new AppendPath which sub-paths are HashPath nodes made above.
> >
> > get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
> > and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
> > These 2 functions help above operations.
> > (I may revise this part to use expression_tree_walker() and
> > expression_tree_mutator().)
> >
> > In patch attached, it has the following limitations.
> > o It only works for hash-join operation.
> > (I want to support not only hash-join but also other logic.)
> > o Join conditions must be "=" operator with int4 variables.
> > o Inner path must be SeqScan.
> > (I want to support other path-node.)
> > o And now, planner may not choose this plan,
> > because estimated costs are usually larger than original (non-pushdown)
> plan.
> >
> > And also 1 internal (static) function, get_relation_constraints()
> > defined in plancat.c, is changed to global. This function will be
> > called from
> > get_replaced_clause_constr() to get CHECK() constraints.
> >
> >
> > Usage
> > -----
> > To use this feature, create partition tables and small table to join,
> > and run select operation with joining these tables.
> >
> > For your convenience, I attach DDL and DML script.
> > And I also attach the result of EXPLAIN.
> >
> >
> > Any comments are welcome. But, at first, I need your advices to
> > correct this patch's behavior.
> >
> > At least, I think it has to expand array of RangeTblEntry and other
> > arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes
> mentioned above.
> > Or, is it better choice to modify query parser to implement this
> > feature more further?
> >
> >
> > Remarks :
> > [1]
> > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80
> > 10F672
> > B(at)BPXM15GP(dot)gisp(dot)nec(dot)co(dot)jp
> >
> >
> > Best regards,
> > --
> > Taiki Kondo
> >
> > NEC Solution Innovators, Ltd.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-09-29 02:48:00 Re: Rework the way multixact truncations work
Previous Message Stephen Frost 2015-09-29 02:18:43 Re: No Issue Tracker - Say it Ain't So!