Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-10-03 19:27:14
Message-ID: CA+TgmoYfD00O908MCbxPzMSjrmemm5Lo8yW41S8EdFy81p8w7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test. So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc. Basically just delete
>> "wise" throughout.
>
> Partition-wise join is standard term used in literature and in
> documentation of other popular DBMSes, so partition_wise makes more
> sense. But I am fine with partition_join as well. Do you want it
> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
> etc.?

Well, you're making me have second thoughts. It's really just that
partition_wise looks a little awkward to me, and maybe that's not
enough reason to change anything. I suppose if I commit it this way
and somebody really hates it, it can always be changed later. We're
not getting a lot of input from anyone else at the moment.

> Attached the updated patch-set.

I decided to skip over 0001 for today and spend some time looking at
0002-0006. Comments below.

0002:

Looks fine.

0003:

The commit message mentions estimate_num_groups but the patch doesn't touch it.

I am concerned that this patch might introduce some problem fixed by
commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that
patch say, at one place, that "This protects against possible
incorrect matches to child expressions that contain no Vars."
However, if a child expression has no Vars, then I think em->em_relids
will be empty, so the bms_is_equal() test that is there now will fail
but your proposed bms_is_subset() test will pass.

0004:

I suggest renaming get_wholerow_ref_from_convert_row_type to
is_converted_whole_row_reference and making it return a bool.

The coding of that function is a little strange; why not move Var to
an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var =
castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

Will the statement that "In case of multi-level partitioning, we will
have as many nested ConvertRowtypeExpr as there are levels in
partition hierarchy" be falsified by Amit Khandekar's pending patch to
avoid sticking a ConvertRowTypeExpr on top of another
ConvertRowTypeExpr? Even if the answer is "no", I think it might be
better to drop this part of the comment; it would be easy for it to
become false in the future, because we might want to optimize that
case in the future and we'll probably forget to update this comment
when we do.

In fix_upper_expr_mutator(), you have an if statement whose entire
contents are another if statement. I think you should use && instead,
and maybe reverse the order of the tests, since
context->subplan_itlist->has_conv_whole_rows is probably cheaper to
test than a function call. It's also a little strange that this code
isn't adjacent too, or merged with, the existing has_non_vars case.
Maybe:

converted_whole_row = is_converted_whole_row_reference(node);
if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
(context->outer_itlist->has_conv_whole_rows && converted_whole_row))
...
if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
(context->inner_itlist->has_conv_whole_rows && converted_whole_row))
...

0005:

The comment explaining why the ParamPathInfo is allocated in the same
context as the RelOptInfo is a modified copy of an existing comment
that still reads like the original, a manner of commenting I find a
bit undesirable as it leads to filling up the source base with
duplicate comments.

I don't think I believe that comment, either. In the case from which
that comment was copied (mark_dummy_rel), it was talking about a
RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
to joinrels creating during a GEQO cycle. But there's no similar
logic for ppilist, so I think what will happen here is that you'll end
up with a freed node in the middle of the list.

I think reparameterize_path_by_chid() could use a helper function
reparameterize_pathlist_by_child() that iterates over a list of paths
and returns a list of paths. That would remove some of the loops.

I think the comments for reparameterize_path_by_child() need to be
expanded. They don't explain how you decided which nodes need to be
handled here or which fields within those nodes need some kind of
handling other than a flat-copy. I think these kinds of explanations
will be important for future maintenance of this code. You know why
you did it this way, I can mostly guess what you did it this way, but
what about the next person who comes along who hasn't made a detailed
study of partition-wise join?

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child(). It's just falling through to
the default case.

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have to be prepared for that and skip building the path. But that
would be more like what reparameterize_path() does, and it would make
failure to include some relevant path type here a corner-case
performance bug rather than a correctness issue. It seems like
someone adding a new path type could quite easily fail to realize that
it might need to be added here, or might be unsure whether it's
necessary to add it here.

0006:

I have some doubts about how stable all of the EXPLAIN outputs are
going to be on the buildfarm. I'm not sure what we can really do
about that in advance of trying them, but it's a lot of EXPLAIN
output. If you have an ideas about how to tighten it up without
losing test coverage, that would be good. For example, maybe the
"full outer join" case isn't needed given the following test case
which is also a full outer join but which covers additional behavior.

I think it would be good to have a test case that shows multi-level
partition-wise join working across multiple levels. I wrote the
attached test, which you're welcome to use if you like it, adapt if
you sorta like it, or replace if you dislike it. The table names at
least should be changed to something less likely to duplicate other
tests.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
mlpartjoin.sql application/octet-stream 2.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-03 19:38:00 Re: UPDATE of partition key
Previous Message Andres Freund 2017-10-03 19:00:33 datetime.h defines like PM conflict with external libraries