Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-09-11 10:45:44
Message-ID: CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/09 2:38, Ashutosh Bapat wrote:
>> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>>> I updated the patch to include just those changes. I'm not sure about
>>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>>> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
>>> the child RTE, child RT index and child Relation are fine, because they
>>> are necessary for creating AppendRelInfos in a desired way for later
>>> planning steps. But PlanRowMarks are not processed within the planner
>>> afterwards and do not need to be marked with the immediate parent-child
>>> association in the same way that AppendRelInfos need to be.
>>
>> Passing top parent's row mark works today, since there is no
>> parent-child specific translation happening there. But if in future,
>> we introduce such a translation, row marks for indirect children in a
>> multi-level partitioned hierarchy won't be accurate. So, I think it's
>> better to pass row marks of the direct parent.
>
> IMHO, we should make it the responsibility of the future patch to set a
> child PlanRowMark's prti to the direct parent's RT index, when we actually
> know that it's needed for something. We clearly know today why we need to
> pass the other objects like child RT entry, RT index, and Relation, so we
> should limit this patch to pass only those objects to the recursive call.
> That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

>
>>> I also included the changes to add_paths_to_append_rel() from my patch on
>>> the "path toward faster partition pruning" thread. We'd need that change,
>>> because while add_paths_to_append_rel() is called for all partitioned
>>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>>> have set up a PartitionedChildRelInfo only for the root parent, so
>>> get_partitioned_child_rels() would not find the same for non-root
>>> partitioned table rels and crash failing the Assert. The changes I made
>>> are such that we call get_partitioned_child_rels() only for the parent
>>> rels that are known to correspond root partitioned tables (or as you
>>> pointed out on the thread, "the table named in the query" as opposed those
>>> added to the query as result of inheritance expansion). In addition to
>>> the relkind check on the input RTE, it would seem that checking that the
>>> reloptkind is RELOPT_BASEREL would be enough. But actually not, because
>>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>>> for the root partitioned table (the table named in the query) would be
>>> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
>>> actually the root partitioned table is to check whether its parent rel is
>>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>>> RTE_SUBQUERY RT entry.
>>>
>>
>> There was a change in my 0003 patch, which fixed the crash. It checked
>> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
>> my 0001 patch. It no more crashes. I tried various queries involving
>> set operations and bare multi-level partitioned table scan with my
>> patch, but none of them showed any anomaly. Do you have a testcase
>> which shows problem with my patch? May be your suggestion is correct,
>> but corresponding code implementation is slightly longer than I would
>> expect. So, we should go with it, if there is corresponding testcase
>> which shows why it's needed.
>
> If we go with your patch, partitioned tables won't get locked, for
> example, in case of the following query (p is a partitioned table):
>
> select 1 from p union all select 2 from p;
>
> That's because the RelOptInfos for the two instances of p in the above
> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
> of the Append corresponding to the UNION ALL subquery RTE. So,
> partitioned_rels does not get set per your proposed code.

Session 1:
postgres=# begin;
BEGIN
postgres=# select 1 from t1 union all select 2 from t1;
?column?
----------
(0 rows)

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
28843
(1 row)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
locktype | relation | virtualxid | virtualtransaction | pid |
mode | granted | fastpath
------------+----------+------------+--------------------+-------+-----------------+---------+----------
relation | pg_locks | | 4/14 | 28854 |
AccessShareLock | t | t
virtualxid | | 4/14 | 4/14 | 28854 |
ExclusiveLock | t | t
relation | t1p1p1 | | 3/9 | 28843 |
AccessShareLock | t | t
relation | t1p1 | | 3/9 | 28843 |
AccessShareLock | t | t
relation | t1 | | 3/9 | 28843 |
AccessShareLock | t | t
virtualxid | | 3/9 | 3/9 | 28843 |
ExclusiveLock | t | t
(6 rows)

So, all partitioned partitions are getting locked correctly. Am I
missing something?

>
>>
>> In your patch
>>
>> + parent_rel = root->simple_rel_array[parent_relid];
>> + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
>>
>> Do you mean RTE_RELATION as you explained above?
>
> No, I mean RTE_SUBQUERY.
>
> If the partitioned table RTE in question corresponds to one named in the
> query, we should be able to find its pcinfo in root->pcinfo_list. If the
> partitioned table RTE is one added as result of inheritance expansion, it
> won't have an associated pcinfo. So, we should find a way to distinguish
> them from one another. The first idea that had occurred to me was the
> same as yours -- RelOptInfo of the partitioned table RTE named in the
> query would be of reloptkind RELOPT_BASEREL and those of the partitioned
> table RTE added as result of inheritance expansion will be of reloptkind
> RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former
> is not. If the partitioned table named in the query appears under UNION
> ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That
> means we have to use some other means to distinguish partitioned table
> RTEs that have an associated pcinfo from those that don't. So, I devised
> this method of looking at the parent RTE (if any) for distinguishing the
> two. Partitioned table named in the query either doesn't have the parent
> or if it does, the parent could only ever be a UNION ALL subquery
> (RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion
> will always have the parent and the parent will only ever be a table
> (RTE_RELATION).
>

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?

>
> In create_lateral_join_info():
>
> + Assert(IS_SIMPLE_REL(brel));
> + Assert(brte);
>
> The second Assert is either unnecessary or should be placed first.

simple_rte_array[] may have some NULL entries. Second assert makes
sure that we aren't dealing with a NULL entry. Any particular reason
to reorder the asserts?

>
> The following comment could be made a bit clearer.
>
> + * In the case of table inheritance, the parent RTE is directly
> linked
> + * to every child table via an AppendRelInfo. In the case of table
> + * partitioning, the inheritance hierarchy is expanded one level at a
> + * time rather than flattened. Therefore, an other member rel
> that is
> + * a partitioned table may have children of its own, and must
> + * therefore be marked with the appropriate lateral info so that
> those
> + * children eventually get marked also.
>
> How about: In the case of partitioned table inheritance, the original
> parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
> Partitions below the first level are accessible only via their immediate
> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
> consider those as well.

I don't see much difference between those two. We usually do not use
macros in comments, so usually comments mention "other member" rel.
Let's leave this for the committer to judge.

>
> In expand_inherited_rtentry(), the following comment fragment is obsolete,
> because we *do* now create AppendRelInfo's for partitioned children:
>
> + /*
> + * We keep a list of objects in root, each of which maps a
> partitioned
> + * parent RT index to the list of RT indexes of its partitioned child
> + * tables which do not have AppendRelInfos associated with those.

Good catch. I have reworded it as
/*
* We keep a list of objects in root, each of which maps a root
* partitioned parent RT index to the list of RT indexes of descendant
* partitioned child tables.

Does that look good?

>
>
> By the way, when we call expand_single_inheritance_child() in the
> non-partitioned inheritance case, we should pass NULL for childrte_p,
> childRTindex_p, childrc_p, instead of declaring variables that won't be
> used. Hence, expand_single_inheritance_child() should make those
> arguments optional.

That introduces an extra "if" condition, which is costlier than an
assignment. We have used both the styles in the code. Previously, I
have got comments otherwise. So, I am not sure.

>
> +
> + /*
> + * If the partitioned table has no partitions or all the partitions are
> + * temporary tables from other backends, treat this as non-inheritance
> + * case.
> + */
> + if (!has_child)
> + parentrte->inh = false;
>
> I guess the above applies to all partitioned tables in the tree, so, I
> think we should update the comment in set_rel_size():
>
> else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
> {
> /*
> * A partitioned table without leaf partitions is marked
> * as a dummy rel.
> */
> set_dummy_rel_pathlist(rel);
> }
>
> to say: a partitioned table without partitions is marked as a dummy rel.

Done. Thanks again for the catch.

I will update the patches once we have some resolution about 1. prti
in PlanRowMarks and 2. detection of root partitioned table in
add_paths_to_append_rel().

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-09-11 10:51:37 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Tomas Vondra 2017-09-11 10:37:22 Re: Polyphase merge is obsolete