Re: Partition-wise join for join between (declaratively) partitioned tables

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-09-11 06:46:59
Message-ID: 05a4e009-3ea6-b294-4668-be745e2ca836@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 2017/09/09 2:38, Ashutosh Bapat wrote:
> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>> I updated the patch to include just those changes. I'm not sure about
>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
>> the child RTE, child RT index and child Relation are fine, because they
>> are necessary for creating AppendRelInfos in a desired way for later
>> planning steps. But PlanRowMarks are not processed within the planner
>> afterwards and do not need to be marked with the immediate parent-child
>> association in the same way that AppendRelInfos need to be.
>
> Passing top parent's row mark works today, since there is no
> parent-child specific translation happening there. But if in future,
> we introduce such a translation, row marks for indirect children in a
> multi-level partitioned hierarchy won't be accurate. So, I think it's
> better to pass row marks of the direct parent.

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

>> I also included the changes to add_paths_to_append_rel() from my patch on
>> the "path toward faster partition pruning" thread. We'd need that change,
>> because while add_paths_to_append_rel() is called for all partitioned
>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>> have set up a PartitionedChildRelInfo only for the root parent, so
>> get_partitioned_child_rels() would not find the same for non-root
>> partitioned table rels and crash failing the Assert. The changes I made
>> are such that we call get_partitioned_child_rels() only for the parent
>> rels that are known to correspond root partitioned tables (or as you
>> pointed out on the thread, "the table named in the query" as opposed those
>> added to the query as result of inheritance expansion). In addition to
>> the relkind check on the input RTE, it would seem that checking that the
>> reloptkind is RELOPT_BASEREL would be enough. But actually not, because
>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>> for the root partitioned table (the table named in the query) would be
>> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
>> actually the root partitioned table is to check whether its parent rel is
>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>> RTE_SUBQUERY RT entry.
>>
>
> There was a change in my 0003 patch, which fixed the crash. It checked
> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
> my 0001 patch. It no more crashes. I tried various queries involving
> set operations and bare multi-level partitioned table scan with my
> patch, but none of them showed any anomaly. Do you have a testcase
> which shows problem with my patch? May be your suggestion is correct,
> but corresponding code implementation is slightly longer than I would
> expect. So, we should go with it, if there is corresponding testcase
> which shows why it's needed.

If we go with your patch, partitioned tables won't get locked, for
example, in case of the following query (p is a partitioned table):

select 1 from p union all select 2 from p;

That's because the RelOptInfos for the two instances of p in the above
query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
of the Append corresponding to the UNION ALL subquery RTE. So,
partitioned_rels does not get set per your proposed code.

>
> In your patch
>
> + parent_rel = root->simple_rel_array[parent_relid];
> + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
>
> Do you mean RTE_RELATION as you explained above?

No, I mean RTE_SUBQUERY.

If the partitioned table RTE in question corresponds to one named in the
query, we should be able to find its pcinfo in root->pcinfo_list. If the
partitioned table RTE is one added as result of inheritance expansion, it
won't have an associated pcinfo. So, we should find a way to distinguish
them from one another. The first idea that had occurred to me was the
same as yours -- RelOptInfo of the partitioned table RTE named in the
query would be of reloptkind RELOPT_BASEREL and those of the partitioned
table RTE added as result of inheritance expansion will be of reloptkind
RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former
is not. If the partitioned table named in the query appears under UNION
ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That
means we have to use some other means to distinguish partitioned table
RTEs that have an associated pcinfo from those that don't. So, I devised
this method of looking at the parent RTE (if any) for distinguishing the
two. Partitioned table named in the query either doesn't have the parent
or if it does, the parent could only ever be a UNION ALL subquery
(RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion
will always have the parent and the parent will only ever be a table
(RTE_RELATION).

>>> One thing I notice is that if I rip out the changes to initsplan.c,
>>> the new regression test still passes. If it's possible to write a
>>> test that fails without those changes, I think it would be a good idea
>>> to include one in the patch. That's certainly one of the subtler
>>> parts of this patch, IMHO.
>>
>> Back when this (step-wise expansion of partition inheritance) used to be a
>> patch in the original declarative partitioning patch series, Ashutosh had
>> reported a test query [1] that would fail getting a plan, for which we
>> came up with the initsplan.c changes in this patch as the solution:
>>
>> ERROR: could not devise a query plan for the given query
>>
>> I tried that query again without the initsplan.c changes and somehow the
>> same error does not occur anymore. It's strange because without the
>> initsplan.c changes, there is no way for partitions lower in the tree than
>> the first level to get the direct_lateral_relids and lateral_relids from
>> the root parent rel. Maybe, Ashutosh has a way to devise the failing
>> query again.
>
> Thanks a lot for the reference. I devised a testcase slightly
> modifying my original test. I have included the test in the latest
> patch set.
>
> I have included Robert's changes to parts other than
> expand_inherited_rtentry() in the patch.
>
>> I also confirmed that the partition-pruning patch set works fine with this
>> patch instead of the patch on that thread with the same functionality,
>> which I will now drop from that patch set. Sorry about the wasted time.
>>
>
> Thanks a lot. Please review the patch in the updated patchset.

Some comments:

In create_lateral_join_info():

+ Assert(IS_SIMPLE_REL(brel));
+ Assert(brte);

The second Assert is either unnecessary or should be placed first.

The following comment could be made a bit clearer.

+ * In the case of table inheritance, the parent RTE is directly
linked
+ * to every child table via an AppendRelInfo. In the case of table
+ * partitioning, the inheritance hierarchy is expanded one level at a
+ * time rather than flattened. Therefore, an other member rel
that is
+ * a partitioned table may have children of its own, and must
+ * therefore be marked with the appropriate lateral info so that
those
+ * children eventually get marked also.

How about: In the case of partitioned table inheritance, the original
parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
Partitions below the first level are accessible only via their immediate
parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
consider those as well.

In expand_inherited_rtentry(), the following comment fragment is obsolete,
because we *do* now create AppendRelInfo's for partitioned children:

+ /*
+ * We keep a list of objects in root, each of which maps a
partitioned
+ * parent RT index to the list of RT indexes of its partitioned child
+ * tables which do not have AppendRelInfos associated with those.

By the way, when we call expand_single_inheritance_child() in the
non-partitioned inheritance case, we should pass NULL for childrte_p,
childRTindex_p, childrc_p, instead of declaring variables that won't be
used. Hence, expand_single_inheritance_child() should make those
arguments optional.

+
+ /*
+ * If the partitioned table has no partitions or all the partitions are
+ * temporary tables from other backends, treat this as non-inheritance
+ * case.
+ */
+ if (!has_child)
+ parentrte->inh = false;

I guess the above applies to all partitioned tables in the tree, so, I
think we should update the comment in set_rel_size():

else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* A partitioned table without leaf partitions is marked
* as a dummy rel.
*/
set_dummy_rel_pathlist(rel);
}

to say: a partitioned table without partitions is marked as a dummy rel.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-09-11 07:04:05 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Bossart, Nathan 2017-09-11 05:05:43 Re: [Proposal] Allow users to specify multiple tables in VACUUM commands