Re: speeding up planning with partitions

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speeding up planning with partitions
Date: 2018-11-14 10:28:32
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David, Imai-san,

Thanks for reviewing. I've included both replies in this email so that I
can attach the latest patch as well.

On 2018/11/10 20:59, David Rowley wrote:
> I've started looking at these two, but only so far made it through
> 0001 and 0002.
> Here's what I noted down during the review.
> 0001:
> 1. Why do we need the new field that this patch adds? I see in 0002
> it's used like:
> + if (childrel->inh_root_parent > 0 &&
> + childrel->inh_root_parent == root->parse->resultRelation)
> Would something like...
> int relid;
> if (childrel->part_schema == NULL &&
> bms_get_singleton_member(childrel->top_parent_relids, &relid) &&
> relid == root->parse->resultRelation)
> ...not do the trick?

Actually, that's one way and earlier patches relied on that, but it gets a
bit ugly given that it's not always the top_parent_relid we're looking for
in the partitioning/inheritance specific code. Consider this:

select *
from (select a from parted_table p union all
select a from inherited_table i) s
where s.a = 1;

top_parent_relids refers to the union all parent subquery 's RT index,
which makes the partitioning/inheritance code scramble through a chain of
parent relation relids to figure out the RT index of the table in the query.

So, inh_root_parent is useful to distinguish the inheritance root parent
from the Append relation root parent, without much code.

That said, in the latest version, I've modified the UPDATE/DELETE planning
patch such that it doesn't use inh_root_parent (or the new code doesn't
need to refer to root parent where it previously did), so I've moved the
patch that adds inh_root_parent to the 2nd in the series.

> 0002:
> 2. What's wrong with childrel->relids?
> + child_relids = bms_make_singleton(appinfo->child_relid);

I've modified the code to not have to use child_relids.

> 3. Why not use childrel->top_parent_relids?
> + top_parent_relids = bms_make_singleton(childrel->inh_root_parent);


> 4. The following comment in inheritance_make_rel_from_joinlist()
> implies that the function will not be called for SELECT, but the
> comment above the function does not mention that.
> /*
> * For UPDATE/DELETE queries, the top parent can only ever be a table.
> * As a contrast, it could be a UNION ALL subquery in the case of SELECT.
> */
> Assert(planner_rt_fetch(top_parent_relid, root)->rtekind == RTE_RELATION);

Fixed the function's header comment to be clear.

> 5. Should the following comment talk about "Sub-partitioned tables"
> rather than "sub-partitions"?
> + * Sub-partitions have to be processed recursively, because
> + * AppendRelInfos link sub-partitions to their immediate parents, not
> + * the root partitioned table.

Okay, done.

> 6. Can't you just pass childrel->relids and
> childrel->top_parent_relids instead of making new ones?
> + child_relids = bms_make_singleton(appinfo->child_relid);
> + Assert(childrel->inh_root_parent > 0);
> + top_parent_relids = bms_make_singleton(childrel->inh_root_parent);
> + translated_joinlist = (List *)
> + adjust_appendrel_attrs_multilevel(root,
> + (Node *) joinlist,
> + child_relids,
> + top_parent_relids);

The new code uses adjust_appendrel_attrs, so those Relids variables are
not needed.

> 7. I'm just wondering what your idea is here?
> + /* Reset join planning specific data structures. */
> + root->join_rel_list = NIL;
> + root->join_rel_hash = NULL;
> Is there a reason to nullify these? You're not releasing any memory
> and the new structures that will be built won't overlap with the ones
> built last round. I don't mean to imply that the code is wrong, it
> just does not appear to be particularly right.

In initial versions of the patch, the same top-level PlannerInfo was used
when calling make_rel_from_joinlist for all child tables. So,
join_rel_hash built for a given child would be assumed to be valid for the
next which isn't true, because joinrels would be different across children.

In the latest patch, inheritance_make_rel_from_joinlist uses different
PlannerInfos for different child target relations, so the above problem
doesn't exist. IOW, you won't see above two lines in the latest patch.

> 8. In regards to:
> + * NB: Do we need to change the child EC members to be marked
> + * as non-child somehow?
> + */
> + childrel->reloptkind = RELOPT_BASEREL;
> I know we talked a bit about this before, but this time I put together
> a crude patch that runs some code each time we skip an em_is_child ==
> true EquivalenceMember. The code checks if any of the em_relids are
> RELOPT_BASEREL. What I'm looking for here are places where we
> erroneously skip the member when we shouldn't. Running the regression
> tests with this patch in place shows a number of problems. Likely I
> should only trigger the warning when bms_membership(em->em_relids) ==
> BMS_SINGLETON, but it never-the-less appears to highlight various
> possible issues. Applying the same on master only appears to show the
> cases where em->em_relids isn't a singleton set. I've attached the
> patch to let you see what I mean.

Thanks for this. I've been thinking about what to do about it, but
haven't decided what's that yet. Please let me spend some more time
thinking on it. AFAICT, dealing with this will ensure that join planning
against target child relations can use EC-based optimizations, but it's
not incorrect as is per se.

On 2018/11/12 13:35, Imai, Yoshikazu wrote:
> adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1
> adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2

Ah, I see what you mean.

The root -> sub1 translation will be repeated for each leaf partition if
done via adjust_appendrel_attrs_multilevel. On the other hand, if we
could do the root to sub1 translation once and pass it to the recursive
call using sub1 as the parent.

I've changed the patch use adjust_appendrel_attrs.

> Since it is difficult to explain my thoughts with words, I will show the
> performance degration case.
> Partition tables are below two sets.

[ ... ]

> Create a generic plan of updation or deletion.
> [create a delete generic plan]
> set plan_cache_mode = 'force_generic_plan';
> prepare delete_stmt(int) as delete from rt where b = $1;
> execute delete_stmt(1);

[ ... ]

> How amount of memory is used with above tests is...
> without v5 patches, Set1: 242MB
> without v5 patches, Set2: 247MB
> with v5 patches, Set1: 420MB
> with v5 patches, Set2: 820MB

Although I didn't aim to fix planning for the generic plan case where no
pruning occurs, the above result is not acceptable. That is, the new
implementation of inheritance update/delete planning shouldn't consume
more memory than the previous. In fact, it should've consumed less,
because the patch claims that it gets rid of redundant processing per

I understood why update/delete planning consumed more memory with the
patch. It was due to a problem with the patch that modifies inheritance
update/delete planning. The exact problem was that the query tree would
be translated (hence copied) *twice* for every partition! First during
query planning where the query tree would be translated to figure out a
targetlist for partitions and then again before calling grouping_planner.
Also, the adjust_appendrel_attrs_multilevel made it worse for multi-level
partitioning case, because of repeated copying for root to intermediate
partitioned tables, as Imai-san pointed out.

I've fixed that making sure that query tree is translated only once and
saved for later steps to use. Imai-san, please check the memory
consumption with the latest patch.

Attached updated patches. Significant revisions are as follows (note that
I reversed the order of 0001 and 0002):


The major changes is fixing the problem above.


No change.


Made inheritance expansion a separate step of make_one_rel whereas
previously it would be invoked at the beginning of set_append_rel_size.
Now, it runs just before set_base_rel_sizes. The same step also
recursively expands (and performs pruning for) any child partitioned
tables that were added by the expansion of partitioned tables originally
mentioned in the query. With this change, we don't need to worry about
the range table changing as set_base_rel_size is executing, which could
lead to problems.


No change.


Attachment Content-Type Size
v6-0001-Overhaul-inheritance-update-delete-planning.patch text/plain 51.5 KB
v6-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch text/plain 2.5 KB
v6-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch text/plain 85.4 KB
v6-0004-Move-append-expansion-code-into-its-own-file.patch text/plain 112.3 KB
v6-0005-Teach-planner-to-only-process-unpruned-partitions.patch text/plain 7.2 KB
v6-0006-Do-not-lock-all-partitions-at-the-beginning.patch text/plain 1.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Surafel Temesgen 2018-11-14 10:48:36 Re: [Todo item] Add entry creation timestamp column to pg_stat_replication
Previous Message Pavel Stehule 2018-11-14 10:26:21 Re: proposal: simple query profile and tracing API