Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-02-01 21:11:52
Message-ID: CA+TgmoZDi9fy6eZzAcMGeECCUXVxKwVFaT-dmsz9kZF+HL2p6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
> on the latest code.

Maybe not surprisingly given how fast things are moving around here
these days, this needs a rebase.

Apart from that, my overall comment on this patch is that it's huge:

37 files changed, 7993 insertions(+), 287 deletions(-)

Now, more than half of that is regression test cases and their output,
which you will certainly be asked to pare down in any version of this
intended for commit. But even excluding those, it's still a fairly
patch:

30 files changed, 2783 insertions(+), 272 deletions(-)

I think the reason this is so large is because there's a fair amount
of refactoring work that has been done as a precondition of the actual
meat of the patch, and no attempt has been made to separate the
refactoring work from the main body of the patch. I think that's
something that needs to be done. If you look at the way Amit Langote
submitted the partitioning patches and the follow-up bug fixes, he had
a series of patches 0001-blah, 0002-quux, etc. generated using
format-patch. Each patch had its own commit message written by him
explaining the purpose of that patch, links to relevant discussion,
etc. If you can separate this into more digestible chunks it will be
easier to get committed.

Other questions/comments:

Why does find_partition_scheme need to copy the partition bound
information instead of just pointing to it? Amit went to some trouble
to make sure that this can't change under us while we hold a lock on
the relation, and we'd better hold a lock on the relation if we're
planning a query against it.

I think the PartitionScheme stuff should live in the optimizer rather
that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
eventually need a new file in the optimizer just for partitioning
stuff, but I'm not sure about that yet.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable. You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

+ * For two partitioned tables with the same
partitioning scheme, it is
+ * assumed that the Oids of matching partitions from
both the tables
+ * are placed at the same position in the array of
partition oids in

Rather than saying that we assume this, you should say why it has to
be true. (If it doesn't have to be true, we shouldn't assume it.)

+ * join relations. Partition tables should have same
layout as the
+ * parent table and hence should not need any
translation. But rest of

The same attributes have to be present with the same types, but they
can be rearranged. This comment seems to imply the contrary.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+ /*
+ * Add this relation to the list of samples ordered by
the increasing
+ * number of rows at appropriate place.
+ */
+ foreach (lc, ordered_child_nos)
+ {
+ int child_no = lfirst_int(lc);
+ RelOptInfo *other_childrel = rel->part_rels[child_no];
+
+ /*
+ * Keep track of child with lowest number of
rows but higher than the
+ * that of the child being inserted. Insert
the child before a
+ * child with highest number of rows lesser than it.
+ */
+ if (child_rel->rows <= other_childrel->rows)
+ insert_after = lc;
+ else
+ break;
+ }

Can we use quicksort instead of a hand-coded insertion sort?

+ if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

+ * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse. You can just add another case for the new
path type.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys? Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL? I assume most of the logic is common.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow. It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work. I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier. Not sure exactly where.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes. It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary. My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that. There might
be some work that needs to or could be done on the comments, too. For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z". They just say that we do
it. To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-02-01 21:13:34 Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message Claudio Freire 2017-02-01 21:02:07 Re: Vacuum: allow usage of more than 1GB of work mem