Re: speeding up planning with partitions

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Imai Yoshikazu <yoshikazu_i443(at)live(dot)jp>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, "'Amit Langote'" <amitlangote09(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speeding up planning with partitions
Date: 2019-03-25 11:34:45
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks a lot for reviewing.

On 2019/03/23 6:07, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> writes:
>> [ v34 patch set ]
> I had a bit of a look through this. I went ahead and pushed 0008 and
> 0009, as they seem straightforward and independent of the rest. (BTW,
> 0009 makes 0003's dubious optimization in set_relation_partition_info
> quite unnecessary.) As for the rest:
> 0001: OK in principle, but I wonder why you implemented it by adding
> another recursive scan of the jointree rather than just iterating
> over the baserel array, as in make_one_rel() for instance. Seems
> like this way is more work, and more code that we'll have to touch
> if we ever change the jointree representation.

I've changed this to work by iterating over baserel array. I was mostly
worried about looping over potentially many elements of baserel array that
we simply end up skipping, but considering the other patch that delays
adding inheritance children, we don't need to worry about that.

> I also feel like you used a dartboard while deciding where to insert the
> call in query_planner(); dropping it into the middle of a sequence of
> equivalence-class-related operations seems quite random. Maybe we could
> delay that step all the way to just before make_one_rel, since the other
> stuff in between seems to only care about baserels? For example,
> it'd certainly be better if we could run remove_useless_joins before
> otherrel expansion, so that we needn't do otherrel expansion at all
> on a table that gets thrown away as being a useless join.

create_lateral_join_info() expects otherrels to be present to propagate
the information it creates.

I have moved add_other_rels_to_query() followed by
create_lateral_join_info() to just before make_one_rel().

> BTW, it strikes me that we could take advantage of the fact that
> baserels must all appear before otherrels in the rel array, by
> having loops over that array stop early if they're only interested
> in baserels. We could either save the index of the last baserel,
> or just extend the loop logic to fall out upon hitting an otherrel.
> Seems like this'd save some cycles when there are lots of partitions,
> although perhaps loops like that would be fast enough to not matter.

As Imai-san's testing shows, there's certainly a slight speedup to be
expected. If we want that, maybe we could use his patch.

> 0002: I seriously dislike this from a system-structure viewpoint.
> For one thing it makes root->processed_tlist rather moot, since it
> doesn't get set till after most of the work is done. (I don't know
> if there are any FDWs out there that look at processed_tlist, but
> they'd be broken by this if so. It'd be better to get rid of the
> field if we can, so that at least such breakage is obvious. Or,
> maybe, use root->processed_tlist as the communication field rather
> than having the tlist be a param to query_planner?)

Getting rid of processed_tlist altogether seems rather daunting to me, so
I've adopted your suggestion of adding any new junk TLEs to
processed_tlist directly.

> I also
> don't like the undocumented way that you've got build_base_rel_tlists
> working on something that's not the final tlist, and then going back
> and doing more work of the same sort later. I wonder whether we can
> postpone build_base_rel_tlists until after the other stuff happens,
> so that it can continue to do all of the tlist-distribution work.

Seeing your other reply to this part, I withdraw 0002 as previously proposed.

Instead, I modified 0003 to teach expand_inherited_rtentry() to add
"wholerow" junk TLE if any foreign table children need it, and a
"tableoid" junk TLE needed for the inheritance case. It also calls
add_vars_to_targetlist() directly to add those vars to parent's reltaget.
The newly added TLEs are added directly to processed_tlist.

"ctid" junk TLE is added by preprocess_targetlist() as before. Maybe, we
can remove the code in preprocess_targetlist() for adding "wholerow" and
"tableoid" junk TLEs, as it's essentially dead code after this patch.

> 0003: I haven't studied this in great detail, but it seems like there's
> some pretty random things in it, like this addition in
> inheritance_planner:
> + /* grouping_planner doesn't need to see the target children. */
> + subroot->append_rel_list = list_copy(orig_append_rel_list);
> That sure seems like a hack unrelated to the purpose of the patch ... and
> since list_copy is a shallow copy, I'm not even sure it's safe. Won't
> we be mutating the contained (and shared) AppendRelInfos as we go along?

Sorry, the comment wasn't very clear. As for how this is related to this
patch, we need subroot to have its own append_rel_list, because
query_planner() will add new entries to it if there are any inheritance
parents that are source relations. We wouldn't want them to be added to
the parent root's append_rel_list, because the next child will want to
start with same append_rel_list in its subroot as the first child.

As for why orig_append_rel_list and not parent_root->append_rel_list, the
latter contains appinfos for target child relations, which need not be
visible to query_planner(). In fact, all loops over append_rel_list
during query planning simply ignore those appinfos, because their parent
relation is not part of the translated query's join tree.

Indeed, we can do copyObject(orig_append_rel_list) and get rid of some
instances of code specific to subqueryRTindexes != NULL. The first block
of such code makes copies of appinfos that reference subquery RTEs, which
can simply be deleted because orig_append_rel_list contains only the
appinfos pertaining to flattened UNION ALL. The second block applies
ChangeVarNodes() to appinfos that reference subquery RTEs, necessitating
copying in the first place, which currently loops over
subroot->append_rel_list to filter out those that don't need to be
changed. If subroot->append_rel_list is a copy of orig_append_rel_list,
this filtering is unnecessary, so we can simply do:

ChangeVarNodes((Node *) subroot->append_rel_list, rti, newrti, 0);

I've made the above changes and updated the comment to reflect that.

> 0004 and 0005: I'm not exactly thrilled about loading more layers of
> overcomplication into inheritance_planner, especially not when the
> complication then extends out into new global planner flags with
> questionable semantics. We should be striving to get rid of that
> function, as previously discussed, and I fear this is just throwing
> more roadblocks in the way of doing so. Can we skip these and come
> back to the question after we replace inheritance_planner with
> expand-at-the-bottom?

As you know, since we're changing things so that source inheritance is
expanded in query_planner(), any UPDATE/DELETE queries containing
inheritance parents as source relations will regress to some degree,
because source inheritance expansion will be repeated for every child query.

I don't like the new flag contains_inherit_children either, because it
will be rendered unnecessary the day we get rid of inheritance_planner,
but I don't see any other way to get what we want.

If we're to forego 0004 because of that complexity, at least we should
consider 0005, because its changes are fairly local to
inheritance_planner(). The speedup and savings in memory consumption by
avoiding putting target child RTEs in the child query are significant, as
discussed upthread.

I have moved 0004 to be the last patch in the series to make way for other
simpler patches to be considered first.

> 0006: Seems mostly OK, but I'm not very happy about the additional
> table_open calls it's throwing into random places in the planner.
> Those can be rather expensive. Why aren't we collecting the appropriate
> info during the initial plancat.c consultation of the relcache?

Hmm, you're seeing those because we're continuing to use the old internal
interfaces of inherit.c. Especially, with the existing interfaces, we
need both the parent and child relations to be open to build the
AppendRelInfo. Note that we are using the same code for initializing both
target child relations and source child relations, and because we don't
have RelOptInfos available in the former case, we can't change the
internal interfaces to use RelOptInfos for passing information around.

Previous versions of this patch did add TupleDesc and Oid fields to
RelOptInfo to store a relation's rd_att and rd_rel->reltype, respectively,
that are needed to construct AppendRelInfo. It also performed massive
refactoring of the internal interfaces of inherit.c to work by passing
around parent RelOptInfo, also considering that one caller
(inheritance_planner) doesn't build RelOptInfos before calling. But I
thought all that refactoring would be a harder sell than adding a
table_open() call in joinrels.c to be able to call make_append_rel_info()
directly for building what's really a no-op AppendRelInfo.

> 0007: Not really sold on this; it seems like it's going to be a net
> negative for cases where there aren't a lot of partitions.

Performance loss for smaller number of partitions is in the noise range,
but what we gain for large number of partitions seems pretty significant
to me:

nparts no live_parts live_parts
====== ============= ==========
2 3397 3391
8 3365 3337
32 3316 3379
128 3338 3399
512 3273 3321
1024 3439 3517
4096 3113 3227
8192 2849 3215

Attached find updated patches.

0002 is a new patch to get rid of the duplicate RTE and PlanRowMark that's
created for partitioned parent table, as it's pointless. I was planning
to propose it later, but decided to post it now if we're modifying the
nearby code anyway.


Attachment Content-Type Size
v35-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch text/plain 12.0 KB
v35-0002-Get-rid-of-duplicate-child-RTE-for-partitioned-t.patch text/plain 1.3 KB
v35-0003-Delay-adding-inheritance-child-tables-until-quer.patch text/plain 40.4 KB
v35-0004-Perform-pruning-in-expand_partitioned_rtentry.patch text/plain 28.5 KB
v35-0005-Teach-planner-to-only-process-unpruned-partition.patch text/plain 7.3 KB
v35-0006-Fix-inheritance_planner-to-avoid-useless-work.patch text/plain 6.4 KB
v35-0007-Adjust-inheritance_planner-to-reuse-source-child.patch text/plain 23.9 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message legrand legrand 2019-03-25 11:36:48 RE: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Previous Message Panagiotis Mavrogiorgos 2019-03-25 11:04:52 Feature: Add Greek language fulltext search