Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-02-10 05:49:09
Message-ID: CAFjFpRfdv2HFv7yB9ER+Qc3pKLT1Wfi+mAoPf2z4E-s3DE+2eA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fixed a problem with the way qsort was being used in the earlier set
of patches. Attached PFA the set of patches with that fixed.

On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Per your suggestion I have split the patch into many smaller patches.
>
> 0001-Refactor-set_append_rel_pathlist.patch
> 0002-Refactor-make_join_rel.patch
> 0003-Refactor-adjust_appendrel_attrs.patch
> 0004-Refactor-build_join_rel.patch
> 0005-Add-function-find_param_path_info.patch
>
> These four refactor existing code.
>
> 0006-Canonical-partition-scheme.patch
> 0007-Partition-wise-join-tests.patch -- just tests, they fail
> 0008-Partition-wise-join.patch -- actual patch implementing
> partition-wise join, still some tests fail\
>
> 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
> 0010-Parameterized-path-fixes.patch
> 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
>
> The last three patches change existing code to expect child(-join)
> relations where they were not expected earlier.
>
> Each patch has summary of the changes.
>
> Partition-wise join for multi-level partitioned tables is not covered
> by these patches. I will post those patches soon.
>
>>
>>>
>>> Other questions/comments:
>>>
>>> Why does find_partition_scheme need to copy the partition bound
>>> information instead of just pointing to it? Amit went to some trouble
>>> to make sure that this can't change under us while we hold a lock on
>>> the relation, and we'd better hold a lock on the relation if we're
>>> planning a query against it.
>>
>> PartitionScheme is shared across multiple relations, join or base,
>> partitioned similarly. Obviously it can't and does not need to point
>> partition bound informations (which should all be same) of all those
>> base relations. O the the face of it, it looks weird that it points to
>> only one of them, mostly the one which it encounters first. But, since
>> it's going to be the same partition bound information, it doesn't
>> matter which one. So, I think, we can point of any one of those. Do
>> you agree?
>
> Instead of copying PartitionBoundInfo, used pointer of the first
> encountered one.
>
>>
>>>
>>> I think the PartitionScheme stuff should live in the optimizer rather
>>> that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
>>> eventually need a new file in the optimizer just for partitioning
>>> stuff, but I'm not sure about that yet.
>>
>> I placed PartitionScheme stuff in partition.c because most of the
>> functions and structures in partition.c are not visible outside that
>> file. But I will try again to locate PartitionScheme to optimizer.
>
> Moved the code as per your suggestion.
>
>>
>>>
>>> The fact that set_append_rel_size needs to reopen the relation to
>>> extract a few more bits of information is not desirable. You need to
>>> fish this information through in some other way; for example, you
>>> could have get_relation_info() stash the needed bits in the
>>> RelOptInfo.
>>
>> I considered this option and discarded it, since not all partitioned
>> relations will have OIDs for partitions e.g. partitioned joins will
>> not have OIDs for their partitions. But now that I think of it, we
>> should probably store those OIDs just for the base relation and leave
>> them unused for non-base relations just like other base relation
>> specific fields in RelOptInfo.
>
> Changed as per your suggestions.
>
>>
>>>
>>> + * For two partitioned tables with the same
>>> partitioning scheme, it is
>>> + * assumed that the Oids of matching partitions from
>>> both the tables
>>> + * are placed at the same position in the array of
>>> partition oids in
>>>
>>> Rather than saying that we assume this, you should say why it has to
>>> be true. (If it doesn't have to be true, we shouldn't assume it.)
>>
>> Will take care of this.
>
> Done. Please check.
>
>>
>>>
>>> + * join relations. Partition tables should have same
>>> layout as the
>>> + * parent table and hence should not need any
>>> translation. But rest of
>>>
>>> The same attributes have to be present with the same types, but they
>>> can be rearranged. This comment seems to imply the contrary.
>>
>> Hmm, will take care of this.
>
> Done.
>
>>
>>>
>>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.
>>
>> +1. Will take care of this. Does "representative_partitions_fraction"
>> or "sample_partition_fraction" look like a good GUC name? Any other
>> suggestions?
>
> used "sample_partition_fraction" for now. Suggestions are welcome.
>
>>
>>>
>>> + /*
>>> + * Add this relation to the list of samples ordered by
>>> the increasing
>>> + * number of rows at appropriate place.
>>> + */
>>> + foreach (lc, ordered_child_nos)
>>> + {
>>> + int child_no = lfirst_int(lc);
>>> + RelOptInfo *other_childrel = rel->part_rels[child_no];
>>> +
>>> + /*
>>> + * Keep track of child with lowest number of
>>> rows but higher than the
>>> + * that of the child being inserted. Insert
>>> the child before a
>>> + * child with highest number of rows lesser than it.
>>> + */
>>> + if (child_rel->rows <= other_childrel->rows)
>>> + insert_after = lc;
>>> + else
>>> + break;
>>> + }
>>>
>>> Can we use quicksort instead of a hand-coded insertion sort?
>>
>> I guess so, if I write comparison functions, which shouldn't be a
>> problem. Will try that.
>
> Done.
>
>>
>>>
>>> + if (bms_num_members(outer_relids) > 1)
>>>
>>> Seems like bms_get_singleton_member could be used.
>
> That code is not required any more.
>
>>>
>>> + * Partitioning scheme in join relation indicates a possibilty that the
>>>
>>> Spelling.
>
> Done.
>
>>>
>>> There seems to be no reason for create_partition_plan to be separated
>>> from create_plan_recurse. You can just add another case for the new
>>> path type.
>
> Done.
>
>>>
>>> Why does create_partition_join_path need to be separate from
>>> create_partition_join_path_with_pathkeys? Couldn't that be combined
>>> into a single function with a pathkeys argument that might sometimes
>>> be NIL? I assume most of the logic is common.
>
> Combined those into a single function.
>
>>>
>>> From a sort of theoretical standpoint, the biggest danger of this
>>> patch seems to be that by deferring path creation until a later stage
>>> than normal, we could miss some important processing.
>>> subquery_planner() does a lot of stuff after
>>> expand_inherited_tables(); if any of those things, especially the ones
>>> that happen AFTER path generation, have an effect on the paths, then
>>> this code needs to compensate for those changes somehow. It seems
>>> like having the planning of unsampled children get deferred until
>>> create_plan() time is awfully surprising; here we are creating the
>>> plan and suddenly what used to be a straightforward path->plan
>>> translation is running around doing major planning work. I can't
>>> entirely justify it, but I somehow have a feeling that work ought to
>>> be moved earlier. Not sure exactly where.
>
> Pasting my previous replies here to keep everything in one mail.
>
> I agree with this. Probably we should add a path tree mutator before
> SS_identify_outer_params() to replace any Partition*Paths with
> Merge/Append paths. The mutator will create paths for child-joins
> within temporary memory context, copy the relevant paths and create
> Merge/Append paths. There are two problems there 1. We have to write
> code to copy paths; most of the paths would be flat copy but custom
> scan paths might have some unexpected problems. 2. There will be many
> surviving PartitionPaths, and all the corresponding child paths would
> need copying and consume memory. In order to reduce that consumption,
> we have run this mutator after set_cheapest() in subquery_planner();
> but then nothing interesting happens between that and create_plan().
> Expanding PartitionPaths during create_plan() does not need any path
> copying and we expand only the PartitionPaths which will be converted
> to plans. That does save a lot of memory; the reason why we defer
> creating paths for child-joins.
>
>>>
>>> This is not really a full review, mostly because I can't easily figure
>>> out the motivation for all of the changes the patch makes. It makes a
>>> lot of changes in a lot of places, and it's not really very easy to
>>> understand why those changes are necessary. My comments above about
>>> splitting the patch into a series of patches that can potentially be
>>> reviewed and applied independently, with the main patch being the last
>>> in the series, are a suggestion as to how to tackle that. There might
>>> be some work that needs to or could be done on the comments, too. For
>>> example, the patch splits out add_paths_to_append_rel from
>>> set_append_rel_pathlist, but the comments don't say anything helpful
>>> like "we need to do X after Y, because Z". They just say that we do
>>> it. To some extent I think the comments in the optimizer have that
>>> problem generally, so it's not entirely the fault of this patch;
>>> still, the lack of those explanations makes the code reorganization
>>> harder to follow, and might confuse future patch authors, too.
>>>
>
> Specifically about add_paths_to_append_rel(), what do you expect the
> comment to say? It would be obvious why we split that functionality
> into a separate function: in fact, we don't necessarily explain why
> certain code resides in a separate function in the comments. I think,
> that particular comment (or for that matter other such comments in the
> optimizer) can be removed altogether, since it just writes the
> function names as an "English" sentence. I sometimes find those
> comments useful, because I can read just those comments and forget
> about the code, making comprehension easy. If highlighting is ON, your
> brain habitually ignores the non-comment portions when required. I am
> open to suggestions.
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
0001-Refactor-set_append_rel_pathlist.patch application/octet-stream 4.0 KB
0002-Refactor-make_join_rel.patch application/octet-stream 2.4 KB
0003-Refactor-adjust_appendrel_attrs.patch application/octet-stream 12.8 KB
0004-Refactor-build_join_rel.patch application/octet-stream 6.7 KB
0005-Add-function-find_param_path_info.patch application/octet-stream 3.6 KB
0006-Canonical-partition-scheme.patch application/octet-stream 16.0 KB
0007-Partition-wise-join-tests.patch application/octet-stream 237.7 KB
0008-Partition-wise-join.patch application/octet-stream 84.3 KB
0009-Adjust-join-related-to-code-to-accept-child-relation.patch application/octet-stream 11.3 KB
0010-Parameterized-path-fixes.patch application/octet-stream 16.3 KB
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch application/octet-stream 5.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-02-10 05:51:54 Re: Removal of deprecated views pg_user, pg_group, pg_shadow
Previous Message Michael Paquier 2017-02-10 05:32:14 Re: contrib modules and relkind check