Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-10-04 18:54:30
Message-ID: CA+TgmoYd0LZE+GcS1R+zTZowUG7cr7KWZutjfcPA2iHxEROkPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 4, 2017 at 8:23 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I am not sure whether your assumption that expression with no Vars
> would have em_relids empty is correct. I wonder whether we will add
> any em_is_child members with empty em_relids; looking at
> process_equivalence() those come from RestrictInfo::left/right_relids
> which just indicates the relids at which that particular expression
> can be evaluated. Place holder vars is an example when that can
> happen, but there may be others. To verify this, I tried attached
> patch on master and ran make check. The assertion didn't trip. If
> em_relids is not NULL, bms_is_subset() is fine.

I spent some more time experimenting with this. I found that cases
where an em_is_child equivalence class contains multiple relids are
quite easy to generate, e.g. select * from foo, bar where foo.a +
bar.a = 0, where foo and bar are partitioned. However, I wasn't able
to generate a case where an em_is_child equivalence class has no
relids at all, and I'm out of ideas about how such a thing could
occur. I suspect it can't. I wondered whether there was some problem
with the multiple-relids case, but I can't find an example where that
misbehaves either. So maybe it's fine (or maybe I'm just not smart
enough to find the case where it breaks).

>> I don't think I believe that comment, either. In the case from which
>> that comment was copied (mark_dummy_rel), it was talking about a
>> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
>> to joinrels creating during a GEQO cycle. But there's no similar
>> logic for ppilist, so I think what will happen here is that you'll end
>> up with a freed node in the middle of the list.
>
> In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
> with dummy path being created in the same context as the RelOptInfo.
> Same applies here.

Oops. I was thinking that the ppilist was attached to some
planner-global structure, but it's not; it's hanging off the
RelOptInfo. So you're entirely right, and I'm just being dumb.

> We need to reparameterize any path which contains further paths and/or
> contains expressions that point to the parent relation. For a given
> path we need to reparameterize any paths that it contains and
> translate any expressions that are specific to that path. Expressions
> common across the paths are translated after the switch case. I have
> added this rule to the comment just above the switch case
> /*
> * Copy of the given path. Reparameterize any paths referenced by the given
> * path. Replace parent Vars in path specific expressions by corresponding
> * child Vars.
> */
> Does that look fine or we want to add explanation for every node handled here.

No, I don't think we want something for every node, just a general
explanation at the top of the function. Maybe something like this:

Most fields from the original path can simply be flat-copied, but any
expressions must be adjusted to refer to the correct varnos, and any
paths must be recursively reparameterized. Other fields that refer to
specific relids also need adjustment.

>> I don't see much point in the T_SubqueryScanPath and T_ResultPath
>> cases in reparameterize_path_by_child(). It's just falling through to
>> the default case.
>
> I added those cases separately to explain why we should not see those
> cases in that switch case. I think that explanation is important
> (esp. considering your comment above) and associating those comment
> with "case" statement looks better. Are you suggesting that we should
> add that explanation in default case?

Or leave the explanation out altogether.

>> I wonder if reparameterize_path_by_child() ought to default to
>> returning NULL rather than throwing an error; the caller would then
>> have to be prepared for that and skip building the path. But that
>> would be more like what reparameterize_path() does, and it would make
>> failure to include some relevant path type here a corner-case
>> performance bug rather than a correctness issue. It seems like
>> someone adding a new path type could quite easily fail to realize that
>> it might need to be added here, or might be unsure whether it's
>> necessary to add it here.
>
> I am OK with that. However reparameterize_path_by_child() and
> reparameterize_paths_by_child() are callers of
> reparameterize_path_by_child() so they will need to deal with NULL
> return. I am fine with that too, but making sure that we are on the
> same page. If we do that, we could simply assert that the switch case
> doesn't see T_SubqueryScanPath and T_ResultPath.

Or do nothing at all about those cases.

I noticed today that the version of the patchset I have here says in
the header comments for reparameterize_path_by_child() that it returns
NULL if it can't reparameterize, but that's not what it actually does.
If you make this change, the existing comment will become correct.

The problem with the NULL return convention is that it's not very
convenient when this function is recursing. Maybe we should change
this function's signature to be bool
reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel,
Path **path); then you could do, e.g. if
(!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual))
return;

But I don't really like that approach; it's still quite long-winded.
Instead, I suggest Stupid Macro Tricks:

#define ADJUST_CHILD_ATTRS(val) \
val = (List *) adjust_appendrel_attrs_multilevel((Node *) val,
child_rel->relids, child_rel->top_parent_relids);

#define REPARAMETERIZE_CHILD_PATH(val) \
val = reparameterize_path_by_child(root, val, child_rel); \
if (val == NULL) \
return NULL;

#define REPARAMETERIZE_CHILD_PATH_LIST(val) \
if (val != NIL) \
{ \
val = reparameterize_pathlist_by_child(root, val, child_rel); \
if (val == NIL) \
return NULL; \
}

With that, a complicated case like T_NestPath becomes just:

JoinPath *jpath;

FLAT_COPY_PATH(jpath, path, NestPath);
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
new_path = (Path *) jpath;

Now, I admit that hiding stuff inside the macro definitions like that
is ugly. But I think it's still better than repeating boilerplate
code with finnicky internal bits lots of times.

> Yes, I too am thinking about the same. The only reason I have EXPLAIN
> output there is to check whether partition-wise join is being used or
> not. The testcase is not interested in the actual shape. It doesn't
> make sense to just test the output if partition-wise join is not used.
> May be a function examining the plan tree would help. The function
> will have to handle Result/Sort nodes on top and make sure that Append
> has join children. Do you have any other idea to check the shape of
> the plan tree without the details? Any EXPLAIN switch, existing
> functions etc.?

No, not really. We may just need to be prepared to fix whatever breaks.

>> I think it would be good to have a test case that shows multi-level
>> partition-wise join working across multiple levels. I wrote the
>> attached test, which you're welcome to use if you like it, adapt if
>> you sorta like it, or replace if you dislike it. The table names at
>> least should be changed to something less likely to duplicate other
>> tests.
>>
>
> There are tests for multi-level partitioned table in the file. They
> test whole partition hierarchy join, part of it being joined based on
> the quals. Search for
> --
> -- multi-leveled partitions
> --
>
> Have you looked at those? They test two-level partitioned tables and
> your test tests three-level partitioned table. I can modify the tests
> to have three levels of partitions and different partition schemes on
> different levels. Is that what you expect?

Oops, no, I just missed the test case. I saw the one that said "inner
join, qual covering only top-level partitions" and missed that there
were others later where the quals covered lower levels also.

Instead of "multi-leveled partitions" it might read better to say
"multiple levels of partitioning".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nico Williams 2017-10-04 19:10:37 Re: Possible SSL improvements for a newcomer to tackle
Previous Message Peter Geoghegan 2017-10-04 18:50:03 Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple