Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-07-31 13:07:22
Message-ID: CAFjFpRexx3dmVzrsocMZFL1jiEcW0HyeNb35Fv2JP4wnO50nmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Forgot the patch set. Here it is.

On Mon, Jul 31, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> Here's revised patch set with only 0004 revised. That patch deals with
>>> creating multi-level inheritance hierarchy from multi-level partition
>>> hierarchy. The original logic of recursively calling
>>> inheritance_planner()'s guts over the inheritance hierarchy required
>>> that for every such recursion we flatten many lists created by that
>>> code. Recursion also meant that root->append_rel_list is traversed as
>>> many times as the number of partitioned partitions in the hierarchy.
>>> Instead the revised version keep the iterative shape of
>>> inheritance_planner() intact, thus naturally creating flat lists,
>>> iterates over root->append_rel_list only once and is still easy to
>>> read and maintain.
>>
>> 0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004:
>>
>> + if (brel->reloptkind != RELOPT_BASEREL &&
>> + brte->relkind != RELKIND_PARTITIONED_TABLE)
>>
>> I spent a lot of time staring at this code before I figured out what
>> was going on here. We're iterating over simple_rel_array, so the
>> reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
>> But does that guarantee that rtekind is RTE_RELATION such that
>> brte->relkind will be initialized to a value? I'm not 100% sure.
>
> Comment in RangeTblEntry says
> 952 /*
> 953 * Fields valid for a plain relation RTE (else zero):
> 954 *
> ... clipped portion for RTE_NAMEDTUPLESTORE related comment
>
> 960 Oid relid; /* OID of the relation */
> 961 char relkind; /* relation kind (see pg_class.relkind) */
>
> This means that relkind will be 0 when rtekind != RTE_RELATION. So,
> the condition holds. But code creating an RTE somewhere which is not
> in sync with this comment would create a problem. So your suggestion
> makes sense.
>
>> I
>> think it would be clearer to write this test like this:
>>
>> Assert(IS_SIMPLE_REL(brel));
>> if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
>> (brte->rtekind != RELOPT_BASEREL ||
>
> Do you mean (brte_>rtekind != RTE_RELATION)?
>
>> brte->relkind != RELKIND_PARTITIONED_TABLE))
>> continue;
>>
>> Note that the way you wrote the comment is says if it *is* another
>> REL, not if it's *not* a baserel; it's good if those kinds of little
>> details match between the code and the comments.
>
> I find the existing comment and code in this part of the function
> differ. The comment just above the loop on simple_rel_array[], talks
> about changing something in the child, but the very next line skips
> child relations and later a loop on append_rel_list makes changes to
> appropriate children. I guess, it's done that way to keep the code
> working even after we introduce some RELOPTKIND other than BASEREL or
> OTHER_MEMBER_REL for a simple rel. But your suggestion makes more
> sense. Changed it according to your suggestion.
>
>>
>> It is not clear to me what the motivation is for the API changes in
>> expanded_inherited_rtentry. They don't appear to be necessary.
>
> expand_inherited_rtentry() creates AppendRelInfos for all the children
> of a given parent and collects them in a list. The list is appended to
> root->append_rel_list at the end of the function. Now that function
> needs to do this recursively. This means that for a partitioned
> partition table its children's AppendRelInfos will be added to
> root->append_rel_list before AppendRelInfo of that partitioned
> partition table. inheritance_planner() assumes that the parent's
> AppendRelInfo comes before its children in append_rel_list.This
> assumption allows it to be simplified greately, retaining its
> iterative form. My earlier patches had recursive version of
> inheritance_planner(), which is complex. I have comments in this patch
> explaining this.
>
> Adding AppendRelInfos to root->append_rel_list as and when they are
> created would keep parent AppendRelInfos before those of children. But
> that function throws away the AppendRelInfo it created when their are
> no real children i.e. in partitioned table's case when has no leaf
> partitions. So, we can't do that. Hence, I chose to change the API to
> return the list of AppendRelInfos when the given RTE has real
> children.
>
>> If
>> they are necessary, you need to do a more thorough job updating the
>> comments. This one, in particular:
>>
>> * If so, add entries for all the child tables to the query's
>> * rangetable, and build AppendRelInfo nodes for all the child tables
>> * and add them to root->append_rel_list. If not, clear the entry's
>
> Done.
>
>>
>> And the comments could maybe say something like "We return the list of
>> appinfos rather than directly appending it to append_rel_list because
>> $REASON."
>
> Done. Please check the attached version.
>
>>
>> - * is a partitioned table.
>> + * RTE simply duplicates the parent *partitioned* table.
>> */
>> - if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
>> + if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)
>>
>> This is another case where it's hard to understand the test from the comments.
>
> The current comment says it all, but it very cryptic manner.
> 1526 /*
> 1527 * Build an AppendRelInfo for this parent and child,
> unless the child
> 1528 * RTE simply duplicates the parent *partitioned* table.
> 1529 */
>
> The comment makes sense in the context of this paragraph in the prologue
> 1364 * Note that the original RTE is considered to represent the whole
> 1365 * inheritance set. The first of the generated RTEs is an RTE for the same
> 1366 * table, but with inh = false, to represent the parent table in its role
> 1367 * as a simple member of the inheritance set.
> 1368 *
>
> The code avoids creating AppendRelInfos for a child which represents
> the parent in its role as a simple member of inheritance set.
>
> I have reworded it as
> 1526 /*
> 1527 * Build an AppendRelInfo for this parent and child,
> unless the child
> 1528 * RTE represents the parent as a simple member of inheritance set.
> 1529 */
>
>>
>> + * In case of multi-level inheritance hierarchy, for every child we require
>> + * PlannerInfo of its immediate parent. Hence we save those in a an array
>>
>> Say why. Also, need to fix "a an".
>
> Done.
>
>>
>> I'm a little bit surprised that this patch doesn't make any changes to
>> allpaths.c or relnode.c.
>
>> It looks to me like we'll generate paths for
>> the new RTEs that are being added. Are we going to end up with
>> multiple levels of Append nodes, then? Does the consider the way
>> consider_parallel is propagated up and down in set_append_rel_size()
>> and set_append_rel_pathlist() really work with multiple levels? Maybe
>> this is all fine; I haven't tried to verify it in depth.
>
> This has been discussed before, but I can not locate the mail
> answering these questions. accumulate_append_subpath() called from
> add_paths_to_append_rel() takes care of flattening Merge/Append paths.
> The planner code deals with the multi-level inheritance hierarchy
> created for subqueries with set operations. There is code in relnode.c
> to build the RelOptInfos for such subqueries recursively through using
> RangeTblEntry::inh flag. So there are no changes in allpaths.c and
> relnode.c. Are you looking for some other changes?
>
>>
>> Overall I think this is a reasonable direction to go but I'm worried
>> that there may be bugs lurking -- other code that needs adjusting that
>> hasn't been found, really.
>>
>
> Planner code is already aware of such hierarchies except DMLs, which
> this patch adjusts. We have fixed issues revealed by mine and
> Rajkumar's testing.
> What kinds of things you suspect?
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_dp_join_patches_v23.tar.gz application/x-gzip 152.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-07-31 13:39:16 Re: pl/perl extension fails on Windows
Previous Message Merlin Moncure 2017-07-31 13:00:43 Re: 10 beta docs: different replication solutions