Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-09-27 20:32:51
Message-ID: CA+TgmoanxG=3dP--_YA8Hy2KM7eLGt=6Ob2oLxsmVvvmH7qhbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> [ new patch ]

This should probably get updated since Rajkumar reported a crash.
Meanwhile, here are some comments from an initial read-through:

+ * Multiple relations may be partitioned in the same way. The relations
+ * resulting from joining such relations may be partitioned in the same way as
+ * the joining relations. Similarly, relations derived from such relations by
+ * grouping, sorting be partitioned in the same as the underlying relations.

I think you should change "may be partitioned in the same way" to "are
partitioned in the same way" or "can be regarded as partitioned in the
same way". The sentence that begins with "Similarly," is not
grammatical; it should say something like: ...by grouping or sorting
are partitioned in the same way as the underlying relations.

@@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
result->bounds->rangeinfo = rangeinfo;
break;
}
}
}

MemoryContextSwitchTo(oldcxt);
rel->rd_partdesc = result;
}

+
/*
* Are two partition bound collections logically equal?
*
* Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
* This is also useful when b1 and b2 are bound collections of two separate
* relations, respectively, because BoundCollection is a canonical
* representation of a set partition bounds (for given partitioning strategy).
*/
bool
partition_bounds_equal(PartitionKey key,

Spurious hunk.

+ * For an umpartitioned table, it returns NULL.

Spelling.

+ * two arguemnts and returns boolean. For types, it
suffices to match

Spelling.

+ * partition key expression is stored as a single member list to accomodate

Spelling.

+ * For a base relation, construct an array of partition key expressions. Each
+ * partition key expression is stored as a single member list to accomodate
+ * more partition keys when relations are joined.

How would joining relations result in more partitioning keys getting
added? Especially given the comment for the preceding function, which
says that a new PartitionScheme gets created unless an exact match is
found.

+ if (!lc)

Test lc == NIL instead of !lc.

+extern int
+PartitionSchemeGetNumParts(PartitionScheme part_scheme)
+{
+ return part_scheme ? part_scheme->nparts : 0;
+}

I'm not convinced it's a very good idea for this function to have
special handling for when part_scheme is NULL. In
try_partition_wise_join() that checks is not needed because it's
already been done, and in generate_partition_wise_join_paths it is
needed but only because you are initializing nparts too early. If you
move this initialization down below the IS_DUMMY_REL() check you won't
need the NULL guard. I would ditch this function and let the callers
access the structure member directly.

+extern int
+PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
+{
+ return part_scheme ? part_scheme->partnatts : 0;
+}

Similarly here. have_partkey_equi_join should probably have a
quick-exit path when part_scheme is NULL, and then num_pks can be set
afterwards unconditionally. Same for match_expr_to_partition_keys.
build_joinrel_partition_info already has it and doesn't need this
double-check.

+extern Oid *
+PartitionDescGetPartOids(PartitionDesc part_desc)
+{
+ Oid *part_oids;
+ int cnt_parts;
+
+ if (!part_desc || part_desc->nparts <= 0)
+ return NULL;
+
+ part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
+ for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
+ part_oids[cnt_parts] = part_desc->oids[cnt_parts];
+
+ return part_oids;
+}

I may be missing something, but this looks like a bad idea in multiple
ways. First, you've got checks for part_desc's validity here that
should be in the caller, as noted above. Second, you're copying an
array by looping instead of using memcpy(). Third, the one and only
caller is set_append_rel_size, which doesn't seem to have any need to
copy this data in the first place. If there is any possibility that
the PartitionDesc is going to change under us while that function is
running, something is deeply broken. Nothing in the planner is going
to cope with the table structure changing under us, so it had better
not.

+ /*
+ * For a partitioned relation, we will save the child RelOptInfos in parent
+ * RelOptInfo in the same the order as corresponding bounds/lists are
+ * stored in the partition scheme.
+ */

This comment seems misplaced; shouldn't it be next to the code that is
actually doing this, rather than the code that is merely setting up
for it? And, also, the comment implies that we're doing this instead
of what we'd normally do, whereas I think we are actually doing
something additional.

+ /*
+ * Save topmost parent's relid. If the parent itself is a child of some
+ * other relation, use parent's topmost parent relids.
+ */
+ if (rel->top_parent_relids)
+ childrel->top_parent_relids = rel->top_parent_relids;
+ else
+ childrel->top_parent_relids = bms_copy(rel->relids);

Comment should explain why we're doing it, not what we're doing. The
comment as written just restates what anybody who's likely to be
looking at this can already see to be true from looking at the code
that follows. The question is why do it.

+ /* Set only for "other" base or join relations. */
+ Relids top_parent_relids;

Comment should say what it is, not just when it's set.

+ /* Should have found all the childrels of a partitioned relation. */
+ if (rel->part_scheme)
+ {
+ int cnt_parts;
+ for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+ Assert(rel->part_rels[cnt_parts]);
+ }

A block that does nothing but Assert() should be guarded by #ifdef
USE_ASSERT_CHECKING. Although, actually, maybe this should be an
elog(), just in case?

+ }
+
+ add_paths_to_append_rel(root, rel, live_childrels);
+}
+
+static void
+add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
+ List *live_childrels)

The new function should have a header comment, which should include an
explanation of why this is now separate from
set_append_rel_pathlist().

+ if (!live_childrels)

As before, I think live_childrels == NIL is better style.

+ generate_partition_wise_join_paths(root, rel);

Needs an update to the comment earlier in the hunk. It's important to
explain why this has to be done here and not within
join_search_one_level.

+ /* Recursively collect the paths from child joinrel. */
+ generate_partition_wise_join_paths(root, child_rel);

Given the recursion, check_stack_depth() at top of function is
probably appropriate. Same for try_partition_wise_join().

+ if (live_children)
+ pfree(live_children);

Given that none of the substructure, including ListCells, will be
freed, this seems utterly pointless. If it's necessary to recover
memory here at all, we probably need to be more aggressive about it.
Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions? If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly? Quadratically? Exponentially? Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.

/*
+ * An inner path parameterized by the parent relation of outer
+ * relation needs to be reparameterized by the outer relation to be used
+ * for parameterized nested loop join.
+ */

No doubt, but I think the comment is missing the bigger picture -- it
doesn't say anything about this being here to support partition-wise
joins, which seems like a key point.

+ /* If we could not translate the path, don't produce nest loop path. */
+ if (!inner_path)
+ return;

Why would that ever happen?

+/*
+ * If the join between the given two relations can be executed as
+ * partition-wise join create the join relations for partition-wise join,
+ * create paths for those and then create append paths to combine
+ * partition-wise join results.
+ */
+static void
+try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+ RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
+ List *parent_restrictlist)

This comment doesn't accurately describe what the function does. No
append paths are created here; that happens at a much later stage. I
think this comment needs quite a bit more work, and maybe the function
should be renamed, too. There are really two steps involved here:
first, we create paths for each child, attached to a new RelOptInfo
flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
paths for the parent RelOptInfo by appending a path for each child.

Broadly, I think there's a lack of adequate documentation of the
overall theory of operation of this patch. I believe that an update
to the optimizer README would be appropriate, probably with a new
section but maybe incorporating the new material into an existing
section. In addition, the comments for individual comments and chunks
of code need to do a better job explaining how each part of the patch
contributes to the overall picture. I also think we need to do a
better join hammering out the terminology. I don't particularly like
the term "partition-wise join" in the first place, although I don't
know what would be better, but we certainly need to avoid confusing a
partition-wise join -- which is a join performed by joining each
partition of one partitioned rel to the corresponding partition of a
similarly partitioned rel rather than by the usual execution strategy
of joining the parent rels -- with the concept of an other-join-rel,
which an other-member-rel analogue for joins. I don't think the patch
is currently very clear about this right now, either in the code or in
the comments. Maybe this function ought to be named something like
make_child_joins() or make_child_join_paths(), and we could use "child
joins" and/or "child join paths" as standard terminology throughout
the patch.

+ rel1_desc = makeStringInfo();
+ rel2_desc = makeStringInfo();
+
+ /* TODO: remove this notice when finalising the patch. */
+ outBitmapset(rel1_desc, rel1->relids);
+ outBitmapset(rel2_desc, rel2->relids);
+ elog(NOTICE, "join between relations %s and %s is considered for
partition-wise join.",
+ rel1_desc->data, rel2_desc->data);

Please remove your debugging cruft before submitting patches to
pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

+ * We allocate the array for child RelOptInfos till we find at least one
+ * join order which can use partition-wise join technique. If no join order
+ * can use partition-wise join technique, there are no child relations.

This comment has problems. I think "till" is supposed to be "until",
and there's supposed to be a "don't" in there somewhere. But really,
I think what you're going for is just /* Allocate when first needed */
which would be a lot shorter and also more clear.

+ * Create join relations for the partition relations, if they do not exist
+ * already. Add paths to those for the given pair of joining relations.

I think the comment could be a bit more explanatory here. Something
like: "This joinrel is partitioned, so iterate over the partitions and
create paths for each one, allowing us to eventually build an
append-of-joins path for the parent. Since this routine may be called
multiple times for various join orders, the RelOptInfo needed for each
child join may or may not already exist, but the paths for this join
order definitely do not. Note that we don't create any actual
AppendPath at this stage; it only makes sense to do that at the end,
after each possible join order has been considered for each child
join. The best join order may differ from child to child."

+ * partiticipating in the given partition relations. We need them

Spelling.

+/*
+ * Construct the SpecialJoinInfo for the partition-wise join using parents'
+ * special join info. Also, instead of
+ * constructing an sjinfo everytime, we should probably save it in
+ * root->join_info_list and search within it like join_is_legal?
+ */

The lines here are of very different lengths for no particularly good
reason, and it should end with a period, not a question mark.

On the substance of the issue, it seems like the way you're doing this
right now could allocate a very large number of SpecialJoinInfo
structures. For every join relation, you'll create one
SpecialJoinInfo per legal join order per partition. That seems like
it could get to be a big number. I don't know if that's going to be a
problem from a memory-usage standpoint, but it seems like it might.
It's not just the SpecialJoinInfo itself; all of the substructure gets
duplicated, too.

+ SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
+ sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
+ append_rel_infos1);

Missing a blank line here.

+ AppendRelInfo *ari = lfirst(lc);

Standard naming convention for an AppendRelInfo variable seems to be
appinfo, not ari. (I just did "git grep AppendRelInfo".)

+ /* Skip non-equi-join clauses. */
+ if (!rinfo->can_join ||
+ rinfo->hashjoinoperator == InvalidOid ||
+ !rinfo->mergeopfamilies)
+ continue;

There's definitely something ugly about this. If rinfo->can_join is
false, then we're done. But suppose one of mergeopfamilies == NIL and
rinfo->hashoperator == InvalidOid is true and the other is false. Are
we really precluded from doing a partiion-wise join in that case, or
are we just prohibited from using certain join strategies? In most
places where we make similar tests, we're careful not to require more
than we need.

I also think that these tests need to consider the partitioning
operator in use. Suppose that the partition key is of a type T which
has two operator classes X and Y. Both relations are partitioned
using an operator from opfamily X, but the join condition mentions
opfamily Y. I'm pretty sure this precludes a partitionwise join. If
the join condition used opfamily X, then we would have a guarantee
that two rows which compared as equal would be in the same partition,
but because it uses opfamily Y, that's not guaranteed. For example,
if T is a text type, X might test for exact equality using "C"
collation rules, while Y might test for equality using some
case-insensitive set of rules. If the partition boundaries are such
that "foo" and "FOO" are in different partitions, a partitionwise join
using the case-insensitive operator will produce wrong results. You
can also imagine this happening with numeric, if you have one opclass
(like the default one) that considers 5.0 and 5.00 to be equal, but
another opclass that thinks they are different; if the latter is used
to set the partition bounds, 5.0 and 5.00 could end up in different
partitions - which will be fine if an operator from that opclass is
used for the join, but not if an operator from the regular opclass is
used.

After thinking this over a bit, I think the right way to think about this is:

1. Amit's patch currently only ever uses btree opfamilies for
partitioning. It uses those for both range partitioning and list
partitioning. If we ever support hash partitioning, we would
presumably use hash opfamilies for that purpose, but right now it's
all about btree opfamilies.

2. Therefore, if A and B are partitioned but the btree opfamilies
don't match, they don't have the same partitioning scheme and this
code should never be reached. Similarly, if they use the same
opfamily but different collations, the partitioning schemes shouldn't
match and therefore this code should not be reached.

3. If A and B are partitioned and the partitioning opfamilies - which
are necessarily btree opfamilies - do match, then the operator which
appears in the query needs to be from the same opfamily and have
amopstrategy of BTEqualStrategyNumber within that opfamily. If not,
then a partition-wise join is not possible.

4. Assuming the above conditions are met, have_partkey_equi_join
doesn't need to care whether the operator chosen has mergeopfamilies
or a valid hashjoinoperator. Those factors will control which join
methods are legal, but not whether a partitionwise join is possible in
principle.

Let me know whether that seems right.

+ * RelabelType node; eval_const_expressions() will have simplied if more

Spelling.

/*
+ * Code below scores equivalence classes by how many equivalence members
+ * can produce join clauses for this join relation. Equivalence members
+ * which do not cover the parents of a partition-wise join relation, can
+ * produce join clauses for partition-wise join relation.
+ */

I don't know what that means. The comma in the second sentence
doesn't belong there.

+ /*
+ * TODO: Instead of copying and mutating the trees one child relation at a
+ * time, we should be able to do this en-masse for all the partitions
+ * involved.
+ */

I don't see how that would be possible, but if it's a TODO, you'd
better do it (or decide not to do it and remove or change the
comment).

/*
* Create explicit sort nodes for the outer and inner paths if necessary.
*/
if (best_path->outersortkeys)
{
+ Relids outer_relids = outer_path->parent->relids;
Sort *sort = make_sort_from_pathkeys(outer_plan,
- best_path->outersortkeys);
+ best_path->outersortkeys,
+ outer_relids);

The changes related to make_sort_from_pathkeys() are pretty opaque to
me. Can you explain?

+ * Change parameterization of sub paths recursively. Also carry out any

"sub paths" should not be two words, here or anywhere.

+reparameterize_path_for_child(PlannerInfo *root, Path *path,
+ RelOptInfo *child_rel)

This is suspiciously unlike reparameterize_path. Why?

+ /* Computer information relevant to the foreign relations. */
+ set_foreign_rel_properties(joinrel, outer_rel, inner_rel);

Perhaps this refactoring could be split out into a preliminary patch,
which would then simplify this patch. And same for add_join_rel().

+ * Produce partition-wise joinrel's targetlist by translating the parent
+ * joinrel's targetlist. This will also include the required placeholder

Again the confusion between a "child" join and a partition-wise join...

+ /*
+ * Nothing to do if
+ * a. partition-wise join is disabled.
+ * b. joining relations are not partitioned.
+ * c. partitioning schemes do not match.
+ */
+

I don't think that's going to survive pgindent.

+ * are not considered equal, an equi-join involing inner partition keys

Spelling.

+ * Collect the partition key expressions. An OUTER join will produce rows
+ * where the partition key columns of inner side are NULL and may not fit
+ * the partitioning scheme with inner partition keys. Since two NULL values
+ * are not considered equal, an equi-join involing inner partition keys
+ * still prohibits cross-partition joins while joining with another
+ * similarly partitioned relation.

I can't figure out what this comment is trying to tell me. Possibly I
just need more caffeine.

+ * Adding these two join_rel_level list also means that top level list has more
+ * than one join relation, which is symantically incorrect.

I don't understand this, either; also, spelling.

As a general comment, the ratio of tests-to-code in this patch is way
out of line with PostgreSQL's normal practices. The total patch file
is 10965 lines. The test cases begin at line 3047, meaning that in
round figures you've got about one-quarter code and about
three-quarters test cases. I suspect that a large fraction of those
test cases aren't adding any meaningful code coverage and will just
take work to maintain. That needs to be slimmed down substantially in
any version of this considered for commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2016-09-27 20:41:10 Re: Make flex/bison checks stricter in Git trees
Previous Message Jesper Pedersen 2016-09-27 19:08:30 Re: Write Ahead Logging for Hash Indexes