Quick Links

Re: Partition-wise join for join between (declaratively) partitioned tables

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-02-06 10:04:37
Message-ID:	CAFjFpRdrQFsqbWDVjbqUNr1MNr_GjryOTO3YmVd4eqrveRGaRQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Feb 2, 2017 at 2:41 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
>> on the latest code.
>
> Maybe not surprisingly given how fast things are moving around here
> these days, this needs a rebase.
>
> Apart from that, my overall comment on this patch is that it's huge:
>
> 37 files changed, 7993 insertions(+), 287 deletions(-)
>
> Now, more than half of that is regression test cases and their output,
> which you will certainly be asked to pare down in any version of this
> intended for commit.

Yes. I will work on that once the design and implementation is in
acceptable state. I have already toned down testcases compared to the
previous patch.

> But even excluding those, it's still a fairly
> patch:
>
> 30 files changed, 2783 insertions(+), 272 deletions(-)
>
> I think the reason this is so large is because there's a fair amount
> of refactoring work that has been done as a precondition of the actual
> meat of the patch, and no attempt has been made to separate the
> refactoring work from the main body of the patch. I think that's
> something that needs to be done. If you look at the way Amit Langote
> submitted the partitioning patches and the follow-up bug fixes, he had
> a series of patches 0001-blah, 0002-quux, etc. generated using
> format-patch. Each patch had its own commit message written by him
> explaining the purpose of that patch, links to relevant discussion,
> etc. If you can separate this into more digestible chunks it will be
> easier to get committed.

I will try to break down the patch into smaller, easy-to-review,
logically cohesive patches.

>
> Other questions/comments:
>
> Why does find_partition_scheme need to copy the partition bound
> information instead of just pointing to it? Amit went to some trouble
> to make sure that this can't change under us while we hold a lock on
> the relation, and we'd better hold a lock on the relation if we're
> planning a query against it.

PartitionScheme is shared across multiple relations, join or base,
partitioned similarly. Obviously it can't and does not need to point
partition bound informations (which should all be same) of all those
base relations. O the the face of it, it looks weird that it points to
only one of them, mostly the one which it encounters first. But, since
it's going to be the same partition bound information, it doesn't
matter which one. So, I think, we can point of any one of those. Do
you agree?

>
> I think the PartitionScheme stuff should live in the optimizer rather
> that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
> eventually need a new file in the optimizer just for partitioning
> stuff, but I'm not sure about that yet.

I placed PartitionScheme stuff in partition.c because most of the
functions and structures in partition.c are not visible outside that
file. But I will try again to locate PartitionScheme to optimizer.

>
> The fact that set_append_rel_size needs to reopen the relation to
> extract a few more bits of information is not desirable. You need to
> fish this information through in some other way; for example, you
> could have get_relation_info() stash the needed bits in the
> RelOptInfo.

I considered this option and discarded it, since not all partitioned
relations will have OIDs for partitions e.g. partitioned joins will
not have OIDs for their partitions. But now that I think of it, we
should probably store those OIDs just for the base relation and leave
them unused for non-base relations just like other base relation
specific fields in RelOptInfo.

>
> + * For two partitioned tables with the same
> partitioning scheme, it is
> + * assumed that the Oids of matching partitions from
> both the tables
> + * are placed at the same position in the array of
> partition oids in
>
> Rather than saying that we assume this, you should say why it has to
> be true. (If it doesn't have to be true, we shouldn't assume it.)

Will take care of this.

>
> + * join relations. Partition tables should have same
> layout as the
> + * parent table and hence should not need any
> translation. But rest of
>
> The same attributes have to be present with the same types, but they
> can be rearranged. This comment seems to imply the contrary.

Hmm, will take care of this.

>
> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+1. Will take care of this. Does "representative_partitions_fraction"
or "sample_partition_fraction" look like a good GUC name? Any other
suggestions?

>
> + /*
> + * Add this relation to the list of samples ordered by
> the increasing
> + * number of rows at appropriate place.
> + */
> + foreach (lc, ordered_child_nos)
> + {
> + int child_no = lfirst_int(lc);
> + RelOptInfo *other_childrel = rel->part_rels[child_no];
> +
> + /*
> + * Keep track of child with lowest number of
> rows but higher than the
> + * that of the child being inserted. Insert
> the child before a
> + * child with highest number of rows lesser than it.
> + */
> + if (child_rel->rows <= other_childrel->rows)
> + insert_after = lc;
> + else
> + break;
> + }
>
> Can we use quicksort instead of a hand-coded insertion sort?

I guess so, if I write comparison functions, which shouldn't be a
problem. Will try that.

>
> + if (bms_num_members(outer_relids) > 1)
>
> Seems like bms_get_singleton_member could be used.
>
> + * Partitioning scheme in join relation indicates a possibilty that the
>
> Spelling.
>
> There seems to be no reason for create_partition_plan to be separated
> from create_plan_recurse. You can just add another case for the new
> path type.
>
> Why does create_partition_join_path need to be separate from
> create_partition_join_path_with_pathkeys? Couldn't that be combined
> into a single function with a pathkeys argument that might sometimes
> be NIL? I assume most of the logic is common.
>
> From a sort of theoretical standpoint, the biggest danger of this
> patch seems to be that by deferring path creation until a later stage
> than normal, we could miss some important processing.
> subquery_planner() does a lot of stuff after
> expand_inherited_tables(); if any of those things, especially the ones
> that happen AFTER path generation, have an effect on the paths, then
> this code needs to compensate for those changes somehow. It seems
> like having the planning of unsampled children get deferred until
> create_plan() time is awfully surprising; here we are creating the
> plan and suddenly what used to be a straightforward path->plan
> translation is running around doing major planning work. I can't
> entirely justify it, but I somehow have a feeling that work ought to
> be moved earlier. Not sure exactly where.
>
> This is not really a full review, mostly because I can't easily figure
> out the motivation for all of the changes the patch makes. It makes a
> lot of changes in a lot of places, and it's not really very easy to
> understand why those changes are necessary. My comments above about
> splitting the patch into a series of patches that can potentially be
> reviewed and applied independently, with the main patch being the last
> in the series, are a suggestion as to how to tackle that. There might
> be some work that needs to or could be done on the comments, too. For
> example, the patch splits out add_paths_to_append_rel from
> set_append_rel_pathlist, but the comments don't say anything helpful
> like "we need to do X after Y, because Z". They just say that we do
> it. To some extent I think the comments in the optimizer have that
> problem generally, so it's not entirely the fault of this patch;
> still, the lack of those explanations makes the code reorganization
> harder to follow, and might confuse future patch authors, too.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Re: Partition-wise join for join between (declaratively) partitioned tables at 2017-02-01 21:11:52 from Robert Haas

Responses

Re: Partition-wise join for join between (declaratively) partitioned tables at 2017-02-06 10:57:18 from Ashutosh Bapat
Re: Partition-wise join for join between (declaratively) partitioned tables at 2017-02-09 10:50:26 from Ashutosh Bapat
Re: Partition-wise join for join between (declaratively) partitioned tables at 2017-02-28 04:50:33 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Martín Marqués	2017-02-06 10:32:13	Re: Index corruption with CREATE INDEX CONCURRENTLY
Previous Message	Konstantin Knizhnik	2017-02-06 09:47:09	Re: Active zombies at AIX