Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-10-04 15:34:06
Message-ID: CA+Tgmob+Uw==ybejCpm7dkN9FE_81vdAYNirCEq-Fn+0DKoYJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I decided to skip over 0001 for today and spend some time looking at
> 0002-0006.

Back to 0001.

+ Enables or disables the query planner's use of partition-wise join
+ plans. When enabled, it spends time in creating paths for joins between
+ partitions and consumes memory to construct expression nodes to be used
+ for those joins, even if partition-wise join does not result in the
+ cheapest path. The time and memory increase exponentially with the
+ number of partitioned tables being joined and they increase linearly
+ with the number of partitions. The default is <literal>off</>.

I think this is too scary and too much technical detail. I think you
could just say something like: Enables or disables use of
partition-wise join, which allows a join between partitioned tables to
be performed by joining the matching partitions. Partition-wise join
currently applies only when the join conditions include all the
columns of the partition keys, which must be of the same data type and
have exactly matching sets of child partitions. Because
partition-wise join planning can use significantly increase CPU time
and memory usage during planning, the default is <literal>off</>.

+partitioned table. The join partners can not be found in other partitions. This
+condition allows the join between partitioned tables to be broken into joins
+between the matching partitions. The resultant join is partitioned in the same

"The join partners can not be found in other partitions." is redundant
with the previous sentence. I suggest deleting it. I also suggest
"This condition allows the join between partitioned tables to be
broken" -> "Because of this, the join between partitioned tables can
be broken".

+relation" for both partitioned table as well as join between partitioned tables
+which can use partition-wise join technique.

for either a partitioned table or a join between compatibly partitioned tables

+Partitioning properties of a partitioned relation are stored in
+PartitionSchemeData structure. Planner maintains a list of canonical partition
+schemes (distinct PartitionSchemeData objects) so that any two partitioned
+relations with same partitioning scheme share the same PartitionSchemeData
+object. This reduces memory consumed by PartitionSchemeData objects and makes
+it easy to compare the partition schemes of joining relations.

Not all of the partitioning properties are stored in the
PartitionSchemeData structure any more. I think this needs some
rethinking and maybe some expansion. As written, each of the first
two sentences needs a "the" at the beginning.

+ /*
+ * Create "append" paths for
partitioned joins. Do this before
+ * creating GatherPaths so that
partial "append" paths in
+ * partitioned joins will be considered.
+ */

I think you could shorten this to a single-line comment and just keep
the first sentence. Similarly in the other location where you have
the same sort of thing.

+ * child-joins. Otherwise, add_path might delete a path that some "append"
+ * path has reference to.

to which some path generated here has a reference.

Here and elsewhere, you use "append" rather than Append to refer to
the paths added. I suppose that's weasel-wording to work around the
fact that they might be either Append or MergeAppend paths, but I'm
not sure it's really going to convey that to anyone. I suggest
rephrasing those comments more generically, e.g.:

+ /* Add "append" paths containing paths from child-joins. */

You could say: Build additional paths for this rel from child-join paths.

Or something.

+ if (!REL_HAS_ALL_PART_PROPS(rel))
+ return;

Isn't this an unnecessarily expensive test? I mean, it shouldn't be
possible for it to have some arbitrary subset.

+ /*
+ * Every pair of joining relations we see here should have an equi-join
+ * between partition keys if this join has been deemed as a partitioned
+ * join. See build_joinrel_partition_info() for reasons.
+ */
+ Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
+
parent_restrictlist));

I suggest removing this assertion. Seems like overkill to me.

+ child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
+
child_rel1->relids,
+
child_rel2->relids);

It seems like we might end up doing this multiple times for the same
child join, if there are more than 2 tables involved. Not sure if
there's a good way to avoid that. Similarly for child_restrictlist.

+ pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);

Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack
allocation is a lot faster, and then you don't need to pfree it.

+ /* Remove the relabel decoration. */

the -> any, decoration -> decorations

+ /*
+ * Replace the Var nodes of parent with those of children in
expressions.
+ * This function may be called within a temporary context, but the
+ * expressions will be shallow-copied into the plan. Hence copy those in
+ * the planner's context.
+ */

I can't make heads or tails of this comment.

--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -23,7 +23,9 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/planmain.h"
+#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/tlist.h"
#include "optimizer/var.h"
#include "parser/parsetree.h"
#include "utils/lsyscache.h"

Maybe not needed? This is the only hunk in this file? Or should this
be part of one of the later patches?

+ Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel));
+
+ /* Ensure child relation is really what it claims to be. */
+ Assert(IS_OTHER_REL(childrel));

I suggest tightening this up a bit by removing the comment and the
blank line that precedes it.

+ foreach(lc, parentrel->reltarget->exprs)
+ {
+ PlaceHolderVar *phv = lfirst(lc);
+
+ if (IsA(phv, PlaceHolderVar))
+ {
+ /*
+ * In case the placeholder Var refers to any
of the parent
+ * relations, translate it to refer to the
corresponding child.
+ */
+ if (bms_overlap(phv->phrels, parentrel->relids) &&
+ childrel->reloptkind == RELOPT_OTHER_JOINREL)
+ {
+ phv = (PlaceHolderVar *)
adjust_appendrel_attrs(root,
+
(Node *) phv,
+
nappinfos,
+
appinfos);
+ }
+
+ childrel->reltarget->exprs =
lappend(childrel->reltarget->exprs,
+
phv);
+ phv_added = true;
+ }
+ }

What if the PHV is buried down inside the expression someplace rather
than being at the top level? More generally, why are we not just
applying adjust_appendrel_attrs() to the whole expression?

+ /* Adjust the cost and width of child targetlist. */
+ if (phv_added)
+ {
+ childrel->reltarget->cost.startup =
parentrel->reltarget->cost.startup;
+ childrel->reltarget->cost.per_tuple =
parentrel->reltarget->cost.per_tuple;
+ childrel->reltarget->width = parentrel->reltarget->width;
+ }

Making this conditional on phv_added is probably not saving anything.
Branches are expensive.

/*
* Otherwise, anything in a baserel or joinrel
targetlist ought to be
- * a Var. (More general cases can only appear in
appendrel child
- * rels, which will never be seen here.)
+ * a Var or ConvertRowtypeExpr. For either of those,
find the original
+ * baserel where they originate.
*/

Hmm, but now we could potentially see an appendrel child rel here, so
don't we need to worry about more general cases? If not, let's
explain why not.

+ * if, it's a ConvertRowtypeExpr, it will be
computed only for the

American usage does not put a comma after if like this (unless you are
writing writing if, for example, blah blah blah -- but there the
commas are to surround for example, not due to the if itself).

+/*
+ * build_joinrel_partition_info
+ * If the join between given partitioned relations is
possibly partitioned
+ * set the partitioning scheme and partition keys
expressions for the
+ * join.
+ *
+ * If the two relations have same partitioning scheme, their join may be
+ * partitioned and will follow the same partitioning scheme as the joining
+ * relations.
+ */

I think you could drop the primary comment block and use the secondary
block as the primary one. That is, get rid of "If the join
between..." and promote "If the two relations...".

+ * The join is not partitioned, if any of the relations being joined are

Another comma that's not typical of American usage.

+ * For an N-way inner join, where every syntactic inner join
has equi-join

has -> has an

+ * For an N-way join with outer joins, where every syntactic join has an
+ * equi-join between partition keys and a matching partitioning scheme,
+ * outer join reordering identities in optimizer/README imply that only
+ * those pairs of join are legal which have an equi-join
between partition
+ * keys. Thus every pair of joining relations we see for this
join should
+ * have an equi-join between partition keys if this join has been deemed
+ * as a partitioned join.

In line 2, partition keys -> the partition keys
In line 3, outer join -> the outer join

"pairs of join" sounds wrong too, although I'm not sure how to reword it.

More broadly: I don't think I understand this comment. The statement
about "those pairs of join are legal which have an equi-join between
partition keys" doesn't match my understanding e.g. A IJ B ON A.x =
B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such
clause syntatically.

Maybe you could replace this whole comment block with something like
this: We can only consider this join as an input to further
partition-wise joins if (a) the input relations are partitioned, (b)
the partition schemes match, and (c) we can identify an equi-join
between the partition keys. Note that if it were possible for
have_partkey_equi_join to return different answers for the same
joinrel depending on which join ordering we try first, this logic
would break. That shouldn't happen, though, because of the way the
query planner deduces implied equalities.

+ * Join relation is partitioned using same partitioning scheme as the
+ * joining relations and has same bounds.

the same partitioning scheme

+ * An INNER join between two partitioned relations is partitioned by key
+ * expressions from both the relations. For tables A and B
partitioned by
+ * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by
+ * both A.a and B.b.
+ *
+ * A SEMI/ANTI join only retains data from the outer side and is
+ * partitioned by the partition keys of the outer side.

I would write: An INNER join between two partitioned relations can be
regarded as partitioned by either key expression. For example, A
INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
B.b; they are equivalent. For a SEMI or ANTI join, the result can
only be regarded as being partitioned in the same manner as the outer
side, since the inner columns are not retained.

+ * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+ * B.b NULL. These rows may not fit the partitioning
conditions imposed on
+ * B.b. Hence, strictly speaking, the join is not partitioned by B.b.

Good.

+ * Strictly speaking, partition keys of an OUTER join should include
+ * partition key expressions from the OUTER side only. Consider a join

I would join this with the previous sentence instead of repeating
strictly speaking: ...and thus the partition keys should include
partition key expressions from the OUTER side only. After that
sentence, I'd skip a lot of the intermediate words here and continue
this way: However, because all commonly-used comparison operators are
strict, the presence of nulls on the outer side doesn't cause any
problem; they can't match anything at future join levels anyway.
Therefore, we track two sets of expressions: those that authentically
partition the relation (partexprs) and those that partition the
relation with the exception that extra nulls may be present
(nullable_partexprs). When the comparison operator is strict, the
latter is just as good as the former.

Then, I think you can omit the rest of what you have; it should be
clear enough what's going on for the full and right cases given that
explanation.

+ * being joined. partexprs and nullable_partexprs are arrays
containing part_scheme->partnatts

Long line, needs reflowing.

I don't think this is too far from being committable. You've done
some nice work here!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-10-04 15:40:32 Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
Previous Message Robert Haas 2017-10-04 15:31:21 Re: Partition-wise join for join between (declaratively) partitioned tables