Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-03-15 12:55:22
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Wed, Mar 15, 2017 at 6:51 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Of course, that supposes that 0009 can manage to postpone creating
>> non-sampled child joinrels until create_partition_join_plan(), which
>> it currently doesn't. In fact, unless I'm missing something, 0009
>> hasn't been even slightly adapted to take advantage of the
>> infrastructure in 0001; it doesn't seem to reset the path_cxt or
>> anything. That seems like a fairly major omission.
> Some other comments on 0009:
> Documentation changes for the new GUCs are missing.

Done. The description might need more massaging, but I will work on
that once we have fixed their names and usage. I think
sample_partition_fraction and partition_wise_plan_weight, if retained,
will be applicable to other partition-wise planning like
partition-wise aggregates. So we will need more generic description

> +between the partition keys of the joining tables. The equi-join between
> +partition keys implies that for a given row in a given partition of a given
> +partitioned table, its joining row, if exists, should exist only in the
> +matching partition of the other partitioned table; no row from non-matching
> There could be more than one. I'd write: The equi-join between
> partition keys implies that all join partners for a given row in one
> partitioned table must be in the corresponding partition of the other
> partitioned table.

Done. I think it's important to emphasize the the joining partners can
not be in other partitions. So, added that sentence after your
suggested sentence.

> +#include "miscadmin.h"
> #include <limits.h>
> #include <math.h>
> Added in wrong place.


> + * System attributes do not need
> translation. In such a case,
> + * the attribute numbers of the parent
> and the child should
> + * start from the same minimum attribute.
> I would delete the second sentence and add an Assert() to that effect instead.

The assertion is there just few lines down. Please let me know if that
suffices. Deleted the second sentence.

> + /* Pass top parent's relids down the inheritance hierarchy. */
> Why?

That is required for a multi-level partitioned table.
top_parent_relids are used for translating expressions of the top
parent to that of child table.

> + for (attno = rel->min_attr; attno <=
> rel->max_attr; attno++)
> Add add a comment explaining why we need to do this.

The comment is there just few lines above. I have moved it just above
this for loop.

> - add_paths_to_append_rel(root, rel, live_childrels);
> + add_paths_to_append_rel(root, rel, live_childrels, false);
> }
> -
> No need to remove blank line.

Sorry. That was added by my patch to refactor
set_append_rel_pathlist(). I have added a patch in the series to
remove that line.

> + * When called on partitioned join relation with partition_join_path = true, it
> + * adds PartitionJoinPath instead of Merge/Append path. This path is costed
> + * based on the costs of sampled child-join and is expanded later into
> + * Merge/Append plan.
> I'm not a big fan of the Merge/Append terminology here. If somebody
> adds another kind of append-path someday, then all of these comments
> will have to be updated. I think this can be phrased more
> generically.

Reworded as
+ * When partition_join_path is true, the caller intends to add a
+ * PartitionJoinPath costed based on the sampled child-joins passed as
+ * live_childrels.

Also added an assertion to make sure the partition_join_path is true
only for join relations.

> /*
> + * While creating PartitionJoinPath, we sample paths from only
> a few child
> + * relations. Even if all of sampled children have partial
> paths, it's not
> + * guaranteed that all the unsampled children will have partial paths.
> + * Hence we do not create partial PartitionJoinPaths.
> + */
> Very sad. I guess if we had parallel append available, we could maybe
> dodge this problem, but for now I suppose we're stuck with it.

Really sad. Is there a way to look at the relation (without any
partial paths yet) and see whether the relation will have partial
paths or not. Even if we don't have actual partial paths but know that
there will be at least one added in the future, we will be able to fix
this problem.

> + /*
> + * Partitioning scheme in join relation indicates a possibility that the
> + * join may be partitioned, but it's not necessary that every pair of
> + * joining relations can use partition-wise join technique. If one of
> + * joining relations turns out to be unpartitioned, this pair of joining
> + * relations can not use partition-wise join technique.
> + */
> + if (!rel1->part_scheme || !rel2->part_scheme)
> + return;
> How can this happen? If rel->part_scheme != NULL, doesn't that imply
> that every rel covered by the joinrel is partitioned that way, and
> therefore this condition must necessarily hold?

I don't remember exactly, but this was added considering a more
generic partition-wise join. But then we would have more changes when
we support that. So, turned this into an assertion.

> In general, I think it's better style to write explicit tests against
> NULL or NIL than to just write if (blahptr).

PG code uses both the styles. Take for example
src/backend/rewrite/rewriteManip.c or any file, both styles are being
used. I find this style useful, when I want to code, say "if this
relation does not have a partitioning scheme" rather than "if this
relation have NULL partitioning scheme". Although I don't have
objections changing it as per your suggestion.

> + partitioned_join->sjinfo = copyObject(parent_sjinfo);
> Why do we need to copy it?

sjinfo in make_join_rel() may be from root->join_info_list or it could
be one made up locally in that function. The one made up in that
function would go away with that function, whereas we need it much
later to create paths for child-joins. So, I thought it's better to
copy it. But now I have changed to code to pass NULL for a made-up
sjinfo. In such a case, the child-join's sjinfo is also made up. This
required some refactoring to separate out the making-up code. So,
there's new refactoring patch.

> + /*
> + * Remove the relabel decoration. We can assume that there is
> at most one
> + * RelabelType node; eval_const_expressions() simplifies multiple
> + * RelabelType nodes into one.
> + */
> + if (IsA(expr, RelabelType))
> + expr = (Expr *) ((RelabelType *) expr)->arg;
> Still, instead of assuming this, you could just s/if/while/, and then
> you wouldn't need the assumption any more. Also, consider castNode().


> partition_wise_plan_weight may be useful for testing, but I don't
> think it should be present in the final patch.

partition_join test needs it so that it can work with smaller dataset
and complete faster. For smaller data sets the partition-wise join
paths come out to be costlier than other kinds and are never chosen.
By setting partition_wise_plan_weight I can force partition-wise join
to be chosen. An alternate solution would be to use
sample_partition_fraction = 1.0, but then we will never test delayed
planning for unsampled child-joins. I also think that users will find
partition_wise_plan_weight useful when estimates based on samples are
unrealistic. Obviously, in a longer run we should be able to provide
better estimates.

Apart from this, I have also removed recursive calls to
try_partition_wise_join() and generate_partition_wise_join_paths()
from 0009 and places them in the 0014 patch. Those are required for
multi-level partitioned tables, which are not supported in 0009.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size application/zip 88.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-03-15 13:18:45 Re: Write Ahead Logging for Hash Indexes
Previous Message Ashutosh Bapat 2017-03-15 12:49:25 Re: Partition-wise join for join between (declaratively) partitioned tables