Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-10-04 12:23:22
Message-ID: CAFjFpRewe+dp46sOx50XU3HSiQ569DAsuiMZKtZLXL2GVvjicg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 4, 2017 at 12:57 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> 0003:
>
> The commit message mentions estimate_num_groups but the patch doesn't touch it.

This was fixed when we converted many rel->reloptkind ==
RELOPT_BASEREL to IS_SIMPLE_REL(). I have removed this section from
the commit message.

>
> I am concerned that this patch might introduce some problem fixed by
> commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that
> patch say, at one place, that "This protects against possible
> incorrect matches to child expressions that contain no Vars."
> However, if a child expression has no Vars, then I think em->em_relids
> will be empty, so the bms_is_equal() test that is there now will fail
> but your proposed bms_is_subset() test will pass.

bms_is_equal() was enough when there was only a single member in
relids but it doesn't work now that there can be multiple of them.
bms_is_equal() was replaced with bms_is_subset() to accomodate for
ec_members with only a subset of relids when we are searching for a
join relation.

I am not sure whether your assumption that expression with no Vars
would have em_relids empty is correct. I wonder whether we will add
any em_is_child members with empty em_relids; looking at
process_equivalence() those come from RestrictInfo::left/right_relids
which just indicates the relids at which that particular expression
can be evaluated. Place holder vars is an example when that can
happen, but there may be others. To verify this, I tried attached
patch on master and ran make check. The assertion didn't trip. If
em_relids is not NULL, bms_is_subset() is fine.

If em_relids could indeed go NULL when em_is_child is true, passing
NULL relids (for parent rels) to that function can cause unwanted
behaviour. bms_is_equal(em->em_relids, relids) will return true
turning the if (em->em_is_child && !bms_is_equal(em->em_relids,
relids)) to false. This means that we will consider a child member
with em_relids NULL even while matching a parent relation. What
surprises me is, that commit added a bunch of testcases and none of
them failed with this change.

Nonetheless, I have changed "matches" with "belongs to" in the
prologue of those functions since an exact match won't be possible
with child-joins.

>
> 0004:
>
> I suggest renaming get_wholerow_ref_from_convert_row_type to
> is_converted_whole_row_reference and making it return a bool.

Done.

>
> The coding of that function is a little strange; why not move Var to
> an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var =
> castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

I probably went too far to avoid indented code :). Fixed now.

>
> Will the statement that "In case of multi-level partitioning, we will
> have as many nested ConvertRowtypeExpr as there are levels in
> partition hierarchy" be falsified by Amit Khandekar's pending patch to
> avoid sticking a ConvertRowTypeExpr on top of another
> ConvertRowTypeExpr? Even if the answer is "no", I think it might be
> better to drop this part of the comment; it would be easy for it to
> become false in the future, because we might want to optimize that
> case in the future and we'll probably forget to update this comment
> when we do.

That might keep someone wondering where the nested
ConvertRowtypeExpr's came from. But may be in future those can arise
from something other than multi-level partition hierarchy and in that
case too the comment would be rendered inaccurate. So done.

>
> In fix_upper_expr_mutator(), you have an if statement whose entire
> contents are another if statement. I think you should use && instead,
> and maybe reverse the order of the tests, since
> context->subplan_itlist->has_conv_whole_rows is probably cheaper to
> test than a function call. It's also a little strange that this code
> isn't adjacent too, or merged with, the existing has_non_vars case.
> Maybe:
>
> converted_whole_row = is_converted_whole_row_reference(node);
> if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
> (context->outer_itlist->has_conv_whole_rows && converted_whole_row))
> ...
> if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
> (context->inner_itlist->has_conv_whole_rows && converted_whole_row))

I placed it with the other node types since it's for a specific node
type, but I guess your suggestion avoids duplicates and looks better.
Done.

> ...
>
> 0005:
>
> The comment explaining why the ParamPathInfo is allocated in the same
> context as the RelOptInfo is a modified copy of an existing comment
> that still reads like the original, a manner of commenting I find a
> bit undesirable as it leads to filling up the source base with
> duplicate comments.

I have pointed to mark_dummy_rel() in that comment instead of
duplicating the whole paragraph.

>
> I don't think I believe that comment, either. In the case from which
> that comment was copied (mark_dummy_rel), it was talking about a
> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
> to joinrels creating during a GEQO cycle. But there's no similar
> logic for ppilist, so I think what will happen here is that you'll end
> up with a freed node in the middle of the list.

In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
with dummy path being created in the same context as the RelOptInfo.
Same applies here. While reparameterizing a path tree, we may reach a
path for a base relation and create a PPI for a base relation. This
may happen when GEQO is planning a join, and thus we are in a
short-lived context created by that GEQO cycle. We don't want a base
rel PPI to be created in that context, so instead we use the context
of base rel itself. Other way round, we don't want to use a longer
context for creating PPI for a join relation when it's created by a
GEQO cycle. So, we use join relation's context.The code doesn't free
up a node in the middle of the list but it avoids such an anomaly. See
[1]
>
> I think reparameterize_path_by_chid() could use a helper function
> reparameterize_pathlist_by_child() that iterates over a list of paths
> and returns a list of paths. That would remove some of the loops.

That's a good idea. Done.

>
> I think the comments for reparameterize_path_by_child() need to be
> expanded. They don't explain how you decided which nodes need to be
> handled here or which fields within those nodes need some kind of
> handling other than a flat-copy. I think these kinds of explanations
> will be important for future maintenance of this code. You know why
> you did it this way, I can mostly guess what you did it this way, but
> what about the next person who comes along who hasn't made a detailed
> study of partition-wise join?

We need to reparameterize any path which contains further paths and/or
contains expressions that point to the parent relation. For a given
path we need to reparameterize any paths that it contains and
translate any expressions that are specific to that path. Expressions
common across the paths are translated after the switch case. I have
added this rule to the comment just above the switch case
/*
* Copy of the given path. Reparameterize any paths referenced by the given
* path. Replace parent Vars in path specific expressions by corresponding
* child Vars.
*/
Does that look fine or we want to add explanation for every node handled here.

>
> I don't see much point in the T_SubqueryScanPath and T_ResultPath
> cases in reparameterize_path_by_child(). It's just falling through to
> the default case.

I added those cases separately to explain why we should not see those
cases in that switch case. I think that explanation is important
(esp. considering your comment above) and associating those comment
with "case" statement looks better. Are you suggesting that we should
add that explanation in default case?

>
> I wonder if reparameterize_path_by_child() ought to default to
> returning NULL rather than throwing an error; the caller would then
> have to be prepared for that and skip building the path. But that
> would be more like what reparameterize_path() does, and it would make
> failure to include some relevant path type here a corner-case
> performance bug rather than a correctness issue. It seems like
> someone adding a new path type could quite easily fail to realize that
> it might need to be added here, or might be unsure whether it's
> necessary to add it here.

I am OK with that. However reparameterize_path_by_child() and
reparameterize_paths_by_child() are callers of
reparameterize_path_by_child() so they will need to deal with NULL
return. I am fine with that too, but making sure that we are on the
same page. If we do that, we could simply assert that the switch case
doesn't see T_SubqueryScanPath and T_ResultPath.

>
> 0006:
>
> I have some doubts about how stable all of the EXPLAIN outputs are
> going to be on the buildfarm. I'm not sure what we can really do
> about that in advance of trying them, but it's a lot of EXPLAIN
> output. If you have an ideas about how to tighten it up without
> losing test coverage, that would be good. For example, maybe the
> "full outer join" case isn't needed given the following test case
> which is also a full outer join but which covers additional behavior.

Yes, I too am thinking about the same. The only reason I have EXPLAIN
output there is to check whether partition-wise join is being used or
not. The testcase is not interested in the actual shape. It doesn't
make sense to just test the output if partition-wise join is not used.
May be a function examining the plan tree would help. The function
will have to handle Result/Sort nodes on top and make sure that Append
has join children. Do you have any other idea to check the shape of
the plan tree without the details? Any EXPLAIN switch, existing
functions etc.?

Removed the extra full outer join testcase.
>
> I think it would be good to have a test case that shows multi-level
> partition-wise join working across multiple levels. I wrote the
> attached test, which you're welcome to use if you like it, adapt if
> you sorta like it, or replace if you dislike it. The table names at
> least should be changed to something less likely to duplicate other
> tests.
>

There are tests for multi-level partitioned table in the file. They
test whole partition hierarchy join, part of it being joined based on
the quals. Search for
--
-- multi-leveled partitions
--

Have you looked at those? They test two-level partitioned tables and
your test tests three-level partitioned table. I can modify the tests
to have three levels of partitions and different partition schemes on
different levels. Is that what you expect?

[1] https://www.postgresql.org/message-id/CAFjFpRcPutbr4nVAsrY-5q%3DwCFrNK25_3MNhHgyYYM0yeOoj%3DQ%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
em_is_child_em_relids.patch text/x-patch 740 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-10-04 12:28:16 Re: Another oddity in handling of WCO constraints in postgres_fdw
Previous Message Andreas Joseph Krogh 2017-10-04 12:20:11 Re: [PATCH] WIP Add ALWAYS DEFERRED option for constraints