Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-02-09 10:50:26
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Per your suggestion I have split the patch into many smaller patches.


These four refactor existing code.

0007-Partition-wise-join-tests.patch -- just tests, they fail
0008-Partition-wise-join.patch -- actual patch implementing
partition-wise join, still some tests fail\


The last three patches change existing code to expect child(-join)
relations where they were not expected earlier.

Each patch has summary of the changes.

Partition-wise join for multi-level partitioned tables is not covered
by these patches. I will post those patches soon.

>> Other questions/comments:
>> Why does find_partition_scheme need to copy the partition bound
>> information instead of just pointing to it? Amit went to some trouble
>> to make sure that this can't change under us while we hold a lock on
>> the relation, and we'd better hold a lock on the relation if we're
>> planning a query against it.
> PartitionScheme is shared across multiple relations, join or base,
> partitioned similarly. Obviously it can't and does not need to point
> partition bound informations (which should all be same) of all those
> base relations. O the the face of it, it looks weird that it points to
> only one of them, mostly the one which it encounters first. But, since
> it's going to be the same partition bound information, it doesn't
> matter which one. So, I think, we can point of any one of those. Do
> you agree?

Instead of copying PartitionBoundInfo, used pointer of the first
encountered one.

>> I think the PartitionScheme stuff should live in the optimizer rather
>> that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
>> eventually need a new file in the optimizer just for partitioning
>> stuff, but I'm not sure about that yet.
> I placed PartitionScheme stuff in partition.c because most of the
> functions and structures in partition.c are not visible outside that
> file. But I will try again to locate PartitionScheme to optimizer.

Moved the code as per your suggestion.

>> The fact that set_append_rel_size needs to reopen the relation to
>> extract a few more bits of information is not desirable. You need to
>> fish this information through in some other way; for example, you
>> could have get_relation_info() stash the needed bits in the
>> RelOptInfo.
> I considered this option and discarded it, since not all partitioned
> relations will have OIDs for partitions e.g. partitioned joins will
> not have OIDs for their partitions. But now that I think of it, we
> should probably store those OIDs just for the base relation and leave
> them unused for non-base relations just like other base relation
> specific fields in RelOptInfo.

Changed as per your suggestions.

>> + * For two partitioned tables with the same
>> partitioning scheme, it is
>> + * assumed that the Oids of matching partitions from
>> both the tables
>> + * are placed at the same position in the array of
>> partition oids in
>> Rather than saying that we assume this, you should say why it has to
>> be true. (If it doesn't have to be true, we shouldn't assume it.)
> Will take care of this.

Done. Please check.

>> + * join relations. Partition tables should have same
>> layout as the
>> + * parent table and hence should not need any
>> translation. But rest of
>> The same attributes have to be present with the same types, but they
>> can be rearranged. This comment seems to imply the contrary.
> Hmm, will take care of this.


>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.
> +1. Will take care of this. Does "representative_partitions_fraction"
> or "sample_partition_fraction" look like a good GUC name? Any other
> suggestions?

used "sample_partition_fraction" for now. Suggestions are welcome.

>> + /*
>> + * Add this relation to the list of samples ordered by
>> the increasing
>> + * number of rows at appropriate place.
>> + */
>> + foreach (lc, ordered_child_nos)
>> + {
>> + int child_no = lfirst_int(lc);
>> + RelOptInfo *other_childrel = rel->part_rels[child_no];
>> +
>> + /*
>> + * Keep track of child with lowest number of
>> rows but higher than the
>> + * that of the child being inserted. Insert
>> the child before a
>> + * child with highest number of rows lesser than it.
>> + */
>> + if (child_rel->rows <= other_childrel->rows)
>> + insert_after = lc;
>> + else
>> + break;
>> + }
>> Can we use quicksort instead of a hand-coded insertion sort?
> I guess so, if I write comparison functions, which shouldn't be a
> problem. Will try that.


>> + if (bms_num_members(outer_relids) > 1)
>> Seems like bms_get_singleton_member could be used.

That code is not required any more.

>> + * Partitioning scheme in join relation indicates a possibilty that the
>> Spelling.


>> There seems to be no reason for create_partition_plan to be separated
>> from create_plan_recurse. You can just add another case for the new
>> path type.


>> Why does create_partition_join_path need to be separate from
>> create_partition_join_path_with_pathkeys? Couldn't that be combined
>> into a single function with a pathkeys argument that might sometimes
>> be NIL? I assume most of the logic is common.

Combined those into a single function.

>> From a sort of theoretical standpoint, the biggest danger of this
>> patch seems to be that by deferring path creation until a later stage
>> than normal, we could miss some important processing.
>> subquery_planner() does a lot of stuff after
>> expand_inherited_tables(); if any of those things, especially the ones
>> that happen AFTER path generation, have an effect on the paths, then
>> this code needs to compensate for those changes somehow. It seems
>> like having the planning of unsampled children get deferred until
>> create_plan() time is awfully surprising; here we are creating the
>> plan and suddenly what used to be a straightforward path->plan
>> translation is running around doing major planning work. I can't
>> entirely justify it, but I somehow have a feeling that work ought to
>> be moved earlier. Not sure exactly where.

Pasting my previous replies here to keep everything in one mail.

I agree with this. Probably we should add a path tree mutator before
SS_identify_outer_params() to replace any Partition*Paths with
Merge/Append paths. The mutator will create paths for child-joins
within temporary memory context, copy the relevant paths and create
Merge/Append paths. There are two problems there 1. We have to write
code to copy paths; most of the paths would be flat copy but custom
scan paths might have some unexpected problems. 2. There will be many
surviving PartitionPaths, and all the corresponding child paths would
need copying and consume memory. In order to reduce that consumption,
we have run this mutator after set_cheapest() in subquery_planner();
but then nothing interesting happens between that and create_plan().
Expanding PartitionPaths during create_plan() does not need any path
copying and we expand only the PartitionPaths which will be converted
to plans. That does save a lot of memory; the reason why we defer
creating paths for child-joins.

>> This is not really a full review, mostly because I can't easily figure
>> out the motivation for all of the changes the patch makes. It makes a
>> lot of changes in a lot of places, and it's not really very easy to
>> understand why those changes are necessary. My comments above about
>> splitting the patch into a series of patches that can potentially be
>> reviewed and applied independently, with the main patch being the last
>> in the series, are a suggestion as to how to tackle that. There might
>> be some work that needs to or could be done on the comments, too. For
>> example, the patch splits out add_paths_to_append_rel from
>> set_append_rel_pathlist, but the comments don't say anything helpful
>> like "we need to do X after Y, because Z". They just say that we do
>> it. To some extent I think the comments in the optimizer have that
>> problem generally, so it's not entirely the fault of this patch;
>> still, the lack of those explanations makes the code reorganization
>> harder to follow, and might confuse future patch authors, too.

Specifically about add_paths_to_append_rel(), what do you expect the
comment to say? It would be obvious why we split that functionality
into a separate function: in fact, we don't necessarily explain why
certain code resides in a separate function in the comments. I think,
that particular comment (or for that matter other such comments in the
optimizer) can be removed altogether, since it just writes the
function names as an "English" sentence. I sometimes find those
comments useful, because I can read just those comments and forget
about the code, making comprehension easy. If highlighting is ON, your
brain habitually ignores the non-comment portions when required. I am
open to suggestions.

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
0001-Refactor-set_append_rel_pathlist.patch application/octet-stream 4.0 KB
0002-Refactor-make_join_rel.patch application/octet-stream 2.4 KB
0003-Refactor-adjust_appendrel_attrs.patch application/octet-stream 12.8 KB
0004-Refactor-build_join_rel.patch application/octet-stream 6.7 KB
0005-Add-function-find_param_path_info.patch application/octet-stream 3.6 KB
0006-Canonical-partition-scheme.patch application/octet-stream 16.0 KB
0007-Partition-wise-join-tests.patch application/octet-stream 237.7 KB
0008-Partition-wise-join.patch application/octet-stream 83.6 KB
0009-Adjust-join-related-to-code-to-accept-child-relation.patch application/octet-stream 11.3 KB
0010-Parameterized-path-fixes.patch application/octet-stream 16.3 KB
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch application/octet-stream 5.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-02-09 11:01:25 Latch reset ordering bug in condition_variable.c
Previous Message Amit Kapila 2017-02-09 10:40:18 Re: Proposal : For Auto-Prewarm.