Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-10-14 04:37:05
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Wed, Sep 28, 2016 at 2:02 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> [ new patch ]
> This should probably get updated since Rajkumar reported a crash.
> Meanwhile, here are some comments from an initial read-through:

Done. Fixed those crashes. Also fixed some crashes in foreign table
code and postgres_fdw. The tests were provided by Rajkumar. I am
working on including those in my patch. The attached patch is still
based on Amit's patches set of patches posted on 15th Sept. 2016. He
is addressing your comments on his patches. So, I am expecting a more
stable version arrive soon. I will rebase my patches then. Because of
a bug in those patches related to multi-level partitioned tables and
lateral joins and also a restriction on sharing partition keys across
levels of partitions, the testcase is still failing. I will work on
that while rebasing the patch.

> + * Multiple relations may be partitioned in the same way. The relations
> + * resulting from joining such relations may be partitioned in the same way as
> + * the joining relations. Similarly, relations derived from such relations by
> + * grouping, sorting be partitioned in the same as the underlying relations.
> I think you should change "may be partitioned in the same way" to "are
> partitioned in the same way" or "can be regarded as partitioned in the
> same way".

The relations resulting from joining partitioned relations are
partitioned in the same way, if there exist equi-join condition/s
between their partition keys. If such equi-joins do not exist, the
join is *not* partitioned. Hence I did not use "are" or "can be" which
indicate a certainty. Instead I used "may" which indicates
"uncertainty". I am not sure whether that's a good place to explain
the conditions under which such relations are partitioned. Those
conditions will change as we implement more and more partition-wise
join strategies. But that comment conveys two things 1. partition
scheme makes sense for all kinds of relations 2. multiple relations
(of any kind) may share partition scheme. I have slightly changed the
wording to make this point clear. Please let me know if it looks

> The sentence that begins with "Similarly," is not
> grammatical; it should say something like: grouping or sorting
> are partitioned in the same way as the underlying relations.

Done. Instead of "are" I have used "may" for the same reason as above.

> @@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
> result->bounds->rangeinfo = rangeinfo;
> break;
> }
> }
> }
> MemoryContextSwitchTo(oldcxt);
> rel->rd_partdesc = result;
> }
> +
> /*
> * Are two partition bound collections logically equal?
> *
> * Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
> * This is also useful when b1 and b2 are bound collections of two separate
> * relations, respectively, because BoundCollection is a canonical
> * representation of a set partition bounds (for given partitioning strategy).
> */
> bool
> partition_bounds_equal(PartitionKey key,
> Spurious hunk.

Thanks. Done.

> + * For an umpartitioned table, it returns NULL.
> Spelling.

Done. Thanks.

> + * two arguemnts and returns boolean. For types, it
> suffices to match
> Spelling.

Thanks. Done.

> + * partition key expression is stored as a single member list to accomodate
> Spelling.

Thanks. Done.

> + * For a base relation, construct an array of partition key expressions. Each
> + * partition key expression is stored as a single member list to accomodate
> + * more partition keys when relations are joined.
> How would joining relations result in more partitioning keys getting
> added? Especially given the comment for the preceding function, which
> says that a new PartitionScheme gets created unless an exact match is
> found.

Let's assume that relation A and B are partitioned by columns a and b
resp. and have same partitioning scheme. This means that the datatypes
of a and b as well as the opclass used for comparing partition key
values of A and B are same. A join between A and B with condition A.a
= B.b is partitioned by both A.a and B.b. We need to keep track of
both the keys in case AB joins with C which is partitioned in the same
manner. I guess, the confusion is with the term "partition keys" -
which is being used to indicate the class of partition key as well as
instance of partition key. In the above example, the datatype of
partition key and the opclass together indicate partition key class
whereas A.a and B.b are instances of that class. Increase in partition
keys may mean both increase in the number of classes or increase in
the number of instances. In the above comment I used to mean number of
instances. May be we should use "partition key expressions" to
indicate the partition key instances and "partition key" to indicate
partition key class. I have changed the comments to use partition keys
and partition key expressions appropriately. Please let me know if the
comments are worded correctly.

PartitionScheme does not hold the actual partition key expressions. It
holds the partition key type and opclass used for comparison, which
should be same for all the relations sharing the partition scheme.

> + if (!lc)
> Test lc == NIL instead of !lc.

NIL is defined as (List *) NULL and lc is ListCell *. So changed the
test to lc == NULL instead of !lc.

> +extern int
> +PartitionSchemeGetNumParts(PartitionScheme part_scheme)
> +{
> + return part_scheme ? part_scheme->nparts : 0;
> +}
> I'm not convinced it's a very good idea for this function to have
> special handling for when part_scheme is NULL. In
> try_partition_wise_join() that checks is not needed because it's
> already been done, and in generate_partition_wise_join_paths it is
> needed but only because you are initializing nparts too early. If you
> move this initialization down below the IS_DUMMY_REL() check you won't
> need the NULL guard. I would ditch this function and let the callers
> access the structure member directly.
> +extern int
> +PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
> +{
> + return part_scheme ? part_scheme->partnatts : 0;
> +}
> Similarly here. have_partkey_equi_join should probably have a
> quick-exit path when part_scheme is NULL, and then num_pks can be set
> afterwards unconditionally. Same for match_expr_to_partition_keys.
> build_joinrel_partition_info already has it and doesn't need this
> double-check.
> +extern Oid *
> +PartitionDescGetPartOids(PartitionDesc part_desc)
> +{
> + Oid *part_oids;
> + int cnt_parts;
> +
> + if (!part_desc || part_desc->nparts <= 0)
> + return NULL;
> +
> + part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
> + for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
> + part_oids[cnt_parts] = part_desc->oids[cnt_parts];
> +
> + return part_oids;
> +}
> I may be missing something, but this looks like a bad idea in multiple
> ways. First, you've got checks for part_desc's validity here that
> should be in the caller, as noted above. Second, you're copying an
> array by looping instead of using memcpy(). Third, the one and only
> caller is set_append_rel_size, which doesn't seem to have any need to
> copy this data in the first place. If there is any possibility that
> the PartitionDesc is going to change under us while that function is
> running, something is deeply broken. Nothing in the planner is going
> to cope with the table structure changing under us, so it had better
> not.

These three functions were written based on Amit Langote's patches
which did not expose partition related structures outside partition.c.
Hence they required wrappers. I have moved PartitionSchemeData to
partition.h and removed these functions. Instead the members are
accessed directly.

> + /*
> + * For a partitioned relation, we will save the child RelOptInfos in parent
> + * RelOptInfo in the same the order as corresponding bounds/lists are
> + * stored in the partition scheme.
> + */
> This comment seems misplaced; shouldn't it be next to the code that is
> actually doing this, rather than the code that is merely setting up
> for it? And, also, the comment implies that we're doing this instead
> of what we'd normally do, whereas I think we are actually doing
> something additional.

Ok. I have moved the comment few line below, near the code which saves
the partition RelOptInfos.

> + /*
> + * Save topmost parent's relid. If the parent itself is a child of some
> + * other relation, use parent's topmost parent relids.
> + */
> + if (rel->top_parent_relids)
> + childrel->top_parent_relids = rel->top_parent_relids;
> + else
> + childrel->top_parent_relids = bms_copy(rel->relids);
> Comment should explain why we're doing it, not what we're doing. The
> comment as written just restates what anybody who's likely to be
> looking at this can already see to be true from looking at the code
> that follows. The question is why do it.

The point of that comment is to explain how it percolates down the
hierarchy, which is not so clear from the code. I have changed it to
* Recursively save topmost parent's relid in RelOptInfos of
* partitions.

Or you are expecting that the comment to explain the purpose of
top_parent_relids? I don't think that's a good idea, since the purpose
will change over the time and the comment will soon be out of sync
with the actual code, unless the developers expanding the usage
remember to update the comment. I have not seen the comments,
explaining purpose, next to the assignments. Take for example

> + /* Set only for "other" base or join relations. */
> + Relids top_parent_relids;
> Comment should say what it is, not just when it's set.

Done. Check if it looks good.

> + /* Should have found all the childrels of a partitioned relation. */
> + if (rel->part_scheme)
> + {
> + int cnt_parts;
> + for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
> + Assert(rel->part_rels[cnt_parts]);
> + }
> A block that does nothing but Assert() should be guarded by #ifdef
> USE_ASSERT_CHECKING. Although, actually, maybe this should be an
> elog(), just in case?

Changed it to elog().

> + }
> +
> + add_paths_to_append_rel(root, rel, live_childrels);
> +}
> +
> +static void
> +add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
> + List *live_childrels)
> The new function should have a header comment, which should include an
> explanation of why this is now separate from
> set_append_rel_pathlist().

Sorry for missing it. Added the prologue. Let me know, if it looks
good. I have made sure that all functions have a prologue and tried to
match the style with surrounding functions. Let me know if I have
still missed any or the styles do not match.

> + if (!live_childrels)
> As before, I think live_childrels == NIL is better style.


> + generate_partition_wise_join_paths(root, rel);
> Needs an update to the comment earlier in the hunk. It's important to
> explain why this has to be done here and not within
> join_search_one_level.

Thanks for pointing that out. Similar to generate_gather_paths(), we
need to add explanation in standard_join_search() as well as in the
function prologue. Did that. Let me know if it looks good.

> + /* Recursively collect the paths from child joinrel. */
> + generate_partition_wise_join_paths(root, child_rel);
> Given the recursion, check_stack_depth() at top of function is
> probably appropriate. Same for try_partition_wise_join().

Done. I wouldn't imagine a user creating that many levels of
partitions, but it's good to guard against some automated script that
has gone berserk.

> + if (live_children)
> + pfree(live_children);
> Given that none of the substructure, including ListCells, will be
> freed, this seems utterly pointless. If it's necessary to recover
> memory here at all, we probably need to be more aggressive about it.

I intended to use list_free() instead of pfree(). Fixed that.

> Have you tested the effect of this patch on planner memory consumption
> with multi-way joins between tables with many partitions? If you
> haven't, you probably should. (Testing runtime would be good, too.)
> Does it grow linearly? Quadratically? Exponentially? Minor leaks
> don't matter, but if we're generating too much garbage we'll have to
> make sure it gets cleaned up soon enough to prevent runaway memory
> usage.

I tried to check memory usage with various combinations of number of
partitions and number of relations being joined. For higher number of
relations being joined like 10 with 100 partitions, OOM killer kicked
in during the planning phase. I am suspecting
adjust_partitionrel_attrs() (changed that name to
adjust_join_appendrel_attrs() to be in sync with
adjust_appendrel_attrs()) to be the culprit. It copies expression
trees every time for joining two children. That's an exponentially
increasing number as the number of legal joins increases
exponentially. I am still investigating this.

As a side question, do we have a function to free an expression tree?
I didn't find any.

> /*
> + * An inner path parameterized by the parent relation of outer
> + * relation needs to be reparameterized by the outer relation to be used
> + * for parameterized nested loop join.
> + */
> No doubt, but I think the comment is missing the bigger picture -- it
> doesn't say anything about this being here to support partition-wise
> joins, which seems like a key point.

I have tried to explain the partition-wise join context. Let me know
if it looks good.

> + /* If we could not translate the path, don't produce nest loop path. */
> + if (!inner_path)
> + return;
> Why would that ever happen?

Right now, reparameterize_path_for_child() does not support all kinds
of paths. So I have added that condition. I will add support for more
path types there once we agree that this is the right way to translate
the paths and that the path translation is required.
> +/*
> + * If the join between the given two relations can be executed as
> + * partition-wise join create the join relations for partition-wise join,
> + * create paths for those and then create append paths to combine
> + * partition-wise join results.
> + */
> +static void
> +try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
> + RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
> + List *parent_restrictlist)
> This comment doesn't accurately describe what the function does. No
> append paths are created here; that happens at a much later stage.

Removed reference to the append paths. Sorry for leaving it there,
when I moved the append path creation to a later stage.

> I
> think this comment needs quite a bit more work, and maybe the function
> should be renamed, too.

Improved the comments in the prologue and inside the function. Please
let me know, if they look good.

> There are really two steps involved here:
> first, we create paths for each child, attached to a new RelOptInfo
> flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
> paths for the parent RelOptInfo by appending a path for each child.

Right, the first one is done in try_partition_wise_join() and the
later is done in generate_partition_wise_join_paths()

> Broadly, I think there's a lack of adequate documentation of the
> overall theory of operation of this patch. I believe that an update
> to the optimizer README would be appropriate, probably with a new
> section but maybe incorporating the new material into an existing
> section.

Done. I have added a separate section to optimizer/README

> In addition, the comments for individual comments and chunks
> of code need to do a better job explaining how each part of the patch
> contributes to the overall picture.

> I also think we need to do a
> better join hammering out the terminology. I don't particularly like
> the term "partition-wise join" in the first place, although I don't
> know what would be better, but we certainly need to avoid confusing a
> partition-wise join -- which is a join performed by joining each
> partition of one partitioned rel to the corresponding partition of a
> similarly partitioned rel rather than by the usual execution strategy
> of joining the parent rels -- with the concept of an other-join-rel,
> which an other-member-rel analogue for joins. I don't think the patch
> is currently very clear about this right now, either in the code or in
> the comments. Maybe this function ought to be named something like
> make_child_joins() or make_child_join_paths(), and we could use "child
> joins" and/or "child join paths" as standard terminology throughout
> the patch.

Partition-wise join is widely used term in the literature. Other
DBMSes use the same term as well. So, I think we should stick with
"partition-wise join". Partition-wise join as you have described is a
join performed by joining each partition of one partitioned rel to the
corresponding partition of a similarly partitioned rel rather than by
the usual execution strategy of joining the parent rels. I have
usually used the term "partition-wise join technique" to refer to this
method. I have changed the other usages of this term to use wording
like child joins or join between partiitions or join between child
relations as appropriate. Also, I have changed the names of functions
dealing with joins between partitions to use child_join instead of
partition_join or partition_wise_join.

Since partition-wise join is a method to join two relations just like
other methods, try_partition_wise_join() fits into the naming
convention try_<join technique> like try_nestloop_join.

> + rel1_desc = makeStringInfo();
> + rel2_desc = makeStringInfo();
> +
> + /* TODO: remove this notice when finalising the patch. */
> + outBitmapset(rel1_desc, rel1->relids);
> + outBitmapset(rel2_desc, rel2->relids);
> + elog(NOTICE, "join between relations %s and %s is considered for
> partition-wise join.",
> + rel1_desc->data, rel2_desc->data);
> Please remove your debugging cruft before submitting patches to
> pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

I kept this one intentionally. But as the TODO comment says, I do
intend to remove it once testing is over. Those messages make it very
easy to know whether partition-wise join was considered for a given
join or not. Without those messages, one has to break into
try_partition_wise_join() to figure out whether partition-wise join
was used or not. The final plan may not come out to be partition-wise
join plan even if partition-wise join was considered. Although, I have
now used DEBUG3 instead of NOTICE and removed those lines from the
expected output.

> + * We allocate the array for child RelOptInfos till we find at least one
> + * join order which can use partition-wise join technique. If no join order
> + * can use partition-wise join technique, there are no child relations.
> This comment has problems. I think "till" is supposed to be "until",
> and there's supposed to be a "don't" in there somewhere. But really,
> I think what you're going for is just /* Allocate when first needed */
> which would be a lot shorter and also more clear.

Sorry for those mistakes. Yes, shorter version is better. Fixed the
comment as per your suggestion.

> + * Create join relations for the partition relations, if they do not exist
> + * already. Add paths to those for the given pair of joining relations.
> I think the comment could be a bit more explanatory here. Something
> like: "This joinrel is partitioned, so iterate over the partitions and
> create paths for each one, allowing us to eventually build an
> append-of-joins path for the parent. Since this routine may be called
> multiple times for various join orders, the RelOptInfo needed for each
> child join may or may not already exist, but the paths for this join
> order definitely do not. Note that we don't create any actual
> AppendPath at this stage; it only makes sense to do that at the end,
> after each possible join order has been considered for each child
> join. The best join order may differ from child to child."

Copied verbatim. Thanks for the detailed comment.

> + * partiticipating in the given partition relations. We need them
> Spelling.

Done. Also fixed other grammatical mistakes and typos in that comment.

> +/*
> + * Construct the SpecialJoinInfo for the partition-wise join using parents'
> + * special join info. Also, instead of
> + * constructing an sjinfo everytime, we should probably save it in
> + * root->join_info_list and search within it like join_is_legal?
> + */
> The lines here are of very different lengths for no particularly good
> reason, and it should end with a period, not a question mark.

My bad. Sorry. Fixed.

> On the substance of the issue, it seems like the way you're doing this
> right now could allocate a very large number of SpecialJoinInfo
> structures. For every join relation, you'll create one
> SpecialJoinInfo per legal join order per partition. That seems like
> it could get to be a big number. I don't know if that's going to be a
> problem from a memory-usage standpoint, but it seems like it might.
> It's not just the SpecialJoinInfo itself; all of the substructure gets
> duplicated, too.

Yes. We need the SpecialJoinInfo structures for the existing path
creation to work. The code will be complicated if we try to use parent
SpecialJoinInfo instead of creating those for children. We may free
memory allocated in SpecialJoinInfo to save some memory.
SpecialJoinInfos are not needed once the paths are created. Still we
will waste some memory for semi_rhs_exprs, which are reused for unique
paths. But otherwise we will reclaim the rest of the memory. Memory
wastage in adjust_partition_relids() may be minimized by modifying
adjust_appendrel_attrs() to accept list of AppendRelInfos and mutating
the tree only once rather than doing it N times for an N-way join.

> + SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
> + sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
> + append_rel_infos1);
> Missing a blank line here.


> + AppendRelInfo *ari = lfirst(lc);
> Standard naming convention for an AppendRelInfo variable seems to be
> appinfo, not ari. (I just did "git grep AppendRelInfo".)


> + /* Skip non-equi-join clauses. */
> + if (!rinfo->can_join ||
> + rinfo->hashjoinoperator == InvalidOid ||
> + !rinfo->mergeopfamilies)
> + continue;
> There's definitely something ugly about this. If rinfo->can_join is
> false, then we're done. But suppose one of mergeopfamilies == NIL and
> rinfo->hashoperator == InvalidOid is true and the other is false. Are
> we really precluded from doing a partiion-wise join in that case, or
> are we just prohibited from using certain join strategies? In most
> places where we make similar tests, we're careful not to require more
> than we need.

Right. That condition is flawed. Corrected it.

> I also think that these tests need to consider the partitioning
> operator in use. Suppose that the partition key is of a type T which
> has two operator classes X and Y. Both relations are partitioned
> using an operator from opfamily X, but the join condition mentions
> opfamily Y. I'm pretty sure this precludes a partitionwise join. If
> the join condition used opfamily X, then we would have a guarantee
> that two rows which compared as equal would be in the same partition,
> but because it uses opfamily Y, that's not guaranteed. For example,
> if T is a text type, X might test for exact equality using "C"
> collation rules, while Y might test for equality using some
> case-insensitive set of rules. If the partition boundaries are such
> that "foo" and "FOO" are in different partitions, a partitionwise join
> using the case-insensitive operator will produce wrong results. You
> can also imagine this happening with numeric, if you have one opclass
> (like the default one) that considers 5.0 and 5.00 to be equal, but
> another opclass that thinks they are different; if the latter is used
> to set the partition bounds, 5.0 and 5.00 could end up in different
> partitions - which will be fine if an operator from that opclass is
> used for the join, but not if an operator from the regular opclass is
> used.

Your description above uses opfamily and opclass interchangeably. It
starts saying X and Y are classed but then also refers to them as
families. But I got the point. I guess, similar to
relation_has_unique_index_for(), I have to check whether the operator
family specified in the partition scheme is present in the
mergeopfamilies in RestrictInfo for matching partition key. I have
added that check and restructured that portion of code to be readable.

> After thinking this over a bit, I think the right way to think about this is:
> 1. Amit's patch currently only ever uses btree opfamilies for
> partitioning. It uses those for both range partitioning and list
> partitioning. If we ever support hash partitioning, we would
> presumably use hash opfamilies for that purpose, but right now it's
> all about btree opfamilies.
> 2. Therefore, if A and B are partitioned but the btree opfamilies
> don't match, they don't have the same partitioning scheme and this
> code should never be reached. Similarly, if they use the same
> opfamily but different collations, the partitioning schemes shouldn't
> match and therefore this code should not be reached.

That's right.

> 3. If A and B are partitioned and the partitioning opfamilies - which
> are necessarily btree opfamilies - do match, then the operator which
> appears in the query needs to be from the same opfamily and have
> amopstrategy of BTEqualStrategyNumber within that opfamily. If not,
> then a partition-wise join is not possible.

I think this is achieved by checking whether the opfamily for given
partition key is present in the mergeopfamilies of corresponding
RestrictInfo, as stated above.

> 4. Assuming the above conditions are met, have_partkey_equi_join
> doesn't need to care whether the operator chosen has mergeopfamilies
> or a valid hashjoinoperator. Those factors will control which join
> methods are legal, but not whether a partitionwise join is possible in
> principle.

If mergeopfamilies is NIL, above check will fail anyway. But skipping
a clause which has mergeopfamilies NIL will save some cycles in
matching expressions.

There is something strange happening with Amit's patch. When we create
a table partitioned by range on a column of type int2vector, it
somehow gets a btree operator family, but doesn't have mergeopfamilies
set in RestrictInfo of equality condition on that column. Instead the
RestrictInfo has hashjoinoperator. In this case if we ignore
hashjoinoperator, we won't be able to apply partition-wise join. I
guess, in such case we want to play safe and not apply partition-wise
join, even though applying it will give the correct result.

> + * RelabelType node; eval_const_expressions() will have simplied if more
> Spelling.

Thanks. Done.

> /*
> + * Code below scores equivalence classes by how many equivalence members
> + * can produce join clauses for this join relation. Equivalence members
> + * which do not cover the parents of a partition-wise join relation, can
> + * produce join clauses for partition-wise join relation.
> + */
> I don't know what that means. The comma in the second sentence
> doesn't belong there.

Sorry for that construction. I have changed the comment to be
something more meaningful.

> + /*
> + * TODO: Instead of copying and mutating the trees one child relation at a
> + * time, we should be able to do this en-masse for all the partitions
> + * involved.
> + */
> I don't see how that would be possible, but if it's a TODO, you'd
> better do it (or decide not to do it and remove or change the
> comment).

That should be doable by passing a list of AppendRelInfo structures to
adjust_appendrel_attrs_mutator(). In the mutator, we have to check
each appinfo instead of just one. But that's a lot of refactoring. May
be done as a separate patch, if we are consuming too much memory. I
have removed TODO for now.

> /*
> * Create explicit sort nodes for the outer and inner paths if necessary.
> */
> if (best_path->outersortkeys)
> {
> + Relids outer_relids = outer_path->parent->relids;
> Sort *sort = make_sort_from_pathkeys(outer_plan,
> - best_path->outersortkeys);
> + best_path->outersortkeys,
> + outer_relids);
> The changes related to make_sort_from_pathkeys() are pretty opaque to
> me. Can you explain?

prepare_sort_from_pathkeys() accepts Relids as one of the argument to
find equivalence members belonging to child relations. The function
does not expect relids when searching equivalence members for parent
relations. Before this patch, make_sort_from_pathkeys() passed NULL to
this function, because it didn't expect child relations before.
Because of partition-wise joins, we need to sort child relations for
merge join or to create unique paths. So, make_sort_from_pathkeys() is
required to pass relids to prepare_sort_from_pathkeys() when
processing child relations, so that the later does not skip child

> + * Change parameterization of sub paths recursively. Also carry out any
> "sub paths" should not be two words, here or anywhere.


> +reparameterize_path_for_child(PlannerInfo *root, Path *path,
> + RelOptInfo *child_rel)
> This is suspiciously unlike reparameterize_path. Why?

reparameterize_path() tries to create path with new parameterization
from an existing parameterized path. So, it looks for additional
conditions to expand the parameterization. But this functions
translates a path parameterized by parent to be parameterized by its
child. That does not involve looking for any extra conditions, but
involves translating the existing ones so that they can be used with a
child. A right name would be translate_parampath_to_child() or
something which uses word "translate" instead of "reparameterize". But
every name like that is getting too long. For now I have renamed it as
reparameterize_path_by_child(). Also added a comment in the function
prologue about cost, rows, width etc.

> + /* Computer information relevant to the foreign relations. */
> + set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
> Perhaps this refactoring could be split out into a preliminary patch,
> which would then simplify this patch. And same for add_join_rel().

Yes, that's better. I will separate the code out in a separate patch.

There's code in build_join_rel() and build_partition_join_rel() (I
will change that name) which creates a joinrel RelOptInfo. Most of
that code simply sets NULL or 0 fields and is duplicated in both the
functions. Do you see any value in separating it out in its own

Also, makeNode() uses palloc0(), thus makeNode(RelOptInfo) would set
most of the fields to 0 or NULL. Why do we then again set those fields
as NULL or 0? Should I try to remove unnecessary assignments?

> + * Produce partition-wise joinrel's targetlist by translating the parent
> + * joinrel's targetlist. This will also include the required placeholder
> Again the confusion between a "child" join and a partition-wise join...
> + /*
> + * Nothing to do if
> + * a. partition-wise join is disabled.
> + * b. joining relations are not partitioned.
> + * c. partitioning schemes do not match.
> + */
> +
> I don't think that's going to survive pgindent.

Changed this code a bit.

> + * are not considered equal, an equi-join involing inner partition keys
> Spelling.
> + * Collect the partition key expressions. An OUTER join will produce rows
> + * where the partition key columns of inner side are NULL and may not fit
> + * the partitioning scheme with inner partition keys. Since two NULL values
> + * are not considered equal, an equi-join involing inner partition keys
> + * still prohibits cross-partition joins while joining with another
> + * similarly partitioned relation.
> I can't figure out what this comment is trying to tell me. Possibly I
> just need more caffeine.

Re-wrote the comment with examples and detailed explanation. The
comment talks about whether inner partition key expressions should be
considered as the partition key expressions for the join, given that
for an OUTER join the inner partition key expressions may go NULL. The
comment explains why it's safe to do so. If we don't do that, any FULL
OUTER join will have no partition expressions and thus partition-wise
join technique will be useless for a N-way FULL OUTER join even if
it's safe to use it.

> + * Adding these two join_rel_level list also means that top level list has more
> + * than one join relation, which is symantically incorrect.
> I don't understand this, either; also, spelling.

I think, that sentence is not required. Removed it.

> As a general comment, the ratio of tests-to-code in this patch is way
> out of line with PostgreSQL's normal practices. The total patch file
> is 10965 lines. The test cases begin at line 3047, meaning that in
> round figures you've got about one-quarter code and about
> three-quarters test cases. I suspect that a large fraction of those
> test cases aren't adding any meaningful code coverage and will just
> take work to maintain. That needs to be slimmed down substantially in
> any version of this considered for commit.

I agree. We require two kinds of tests 1. those which test partition
scheme matching 2. those test the planner code, which deals with path
creation. I have added both kinds of testcases for all kinds of
partitioning schemes (range, list, multi-level, partition key being
expressions, columns). That's not required. We need 1st kind of tests
for all partitioning schemes and 2nd kind of testcases only for one of
the partitioning schemes. So, definitely the number of tests will
reduce. A possible extreme would be to use a single multi-level
partitioned tests, which includes all kinds of partitioning schemes at
various partition levels. But that kind of testcase will be highly
unreadable and harder to maintain. Let me know what do you think. I
will work on that in the next version of patch. The test still fails
because of a bug in Amit's earlier set of patches

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_dp_join_v4.patch text/plain 512.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-10-14 04:38:06 Re: Renaming of pg_xlog and pg_clog
Previous Message Peter Eisentraut 2016-10-14 04:23:57 Re: Re: [COMMITTERS] pgsql: Extend framework from commit 53be0b1ad to report latch waits.