Re: speeding up planning with partitions

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Imai Yoshikazu <yoshikazu_i443(at)live(dot)jp>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speeding up planning with partitions
Date: 2019-03-29 03:44:44
Message-ID: 5c83dbca-12b5-1acf-0e85-58299e464a26@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks a lot for hacking on the patch. I'm really happy with the
direction you took for inheritance_planner, as it allows UPDATE/DELETE to
use partition pruning.

On 2019/03/29 7:38, Tom Lane wrote:
> I've been hacking on this pretty hard over the last couple days,
> because I really didn't like the contortions you'd made to allow
> inheritance_planner to call expand_inherited_rtentry in a completely
> different context than the regular code path did. I eventually got
> rid of that

Good riddance.

> by having inheritance_planner run one cycle of planning
> the query as if it were a SELECT, and extracting the list of unpruned
> children from that.

This is somewhat like my earlier patch that we decided to not to pursue,
minus all the hackery within query_planner() that was in that patch, which
is great.

(I can't find the link, but David Rowley had posted a patch for allowing
UPDATE/DELETE to use partition pruning in the late stages of PG 11
development, which had taken a similar approach.)

> I had to rearrange the generation of the final
> rtable a bit to make that work, but I think that inheritance_planner
> winds up somewhat cleaner and safer this way; it's making (slightly)
> fewer assumptions about how much the results of planning the child
> queries can vary.
>
> Perhaps somebody will object that that's one more planning pass than
> we had before, but I'm not very concerned, because
> (a) at least for partitioned tables that we can prune successfully,
> this should still be better than v11, since we avoid the planning
> passes for pruned children.

Certainly. Note that previously we'd always scan *all* hash partitions
for UPDATE and DELETE queries, because constraint exclusion can't exclude
hash partitions due to the shape of their partition constraint.

I ran my usual benchmark with up to 8192 partitions.

N: 2..8192

create table rt (a int, b int, c int) partition by range (a);
select 'create table rt' || x::text || ' partition of rt for values from
(' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1,
N) x;
\gexec

update.sql:

\set param random(1, N)
update rt set a = 0 where a = :param;

pgbench -n -T 120 -f select.sql

nparts v38 HEAD
====== ==== ====
2 2971 2969
8 2980 1949
32 2955 733
128 2946 145
512 2924 11
1024 2986 3
4096 2702 0
8192 2531 OOM

Obviously, you'll get similar numbers with hash or list partitioning.

> (b) inheritance_planner is horridly inefficient anyway, in that it
> has to run a near-duplicate planning pass for each child table.
> If we're concerned about its cost, we should be working to get rid of
> the function altogether, as per [1]. In the meantime, I do not want
> to contort other code to make life easier for inheritance_planner.

Agreed.

> There's still some loose ends:
>
> 1. I don't like 0003 much, and omitted it from the attached.
> I think that what we ought to be doing instead is not having holes
> in the rel_parts[] arrays to begin with, ie they should only include
> the unpruned partitions. If we are actually encoding any important
> information in those array positions, I suspect that is broken anyway
> in view of 898e5e329: we can't assume that the association of child
> rels with particular PartitionDesc slots will hold still from planning
> to execution.

It's useful for part_rels array to be indexed in the same way as
PartitionDesc. Firstly, because partition pruning code returns the
PartitionDesc-defined indexes of unpruned partitions. Second,
partitionwise join code decides two partitioned tables as being compatible
for partitionwise joining, then it must join partitions that have
identical *PartitionDesc* indexes, which is what it does by part_rels
arrays of both sides in one loop.

Regarding the impact of 898e5e329 on this, I think it invented
PartitionDirectory exactly to avoid PartitionDesc changing under us
affecting the planning or execution of a given query. As for
PartitionDesc indexes being different between planning and execution, it
only affects PartitionPruneInfo and the commit did make changes to
ExecCreatePartitionPruningState to remap the old indexes of unpruned
partitions in PartitionPruneInfo (as they were during planning) to the new
ones.

> 2. I seriously dislike what's been done in joinrels.c, too. That
> really seems like a kluge (and I haven't had time to study it
> closely).

Those hunks account for the fact that pruned partitions, for which we no
longer create RangeTblEntry and RelOptInfo, may appear on the nullable
side of an outer join. We'll need a RelOptInfo holding a dummy path, so
that outer join paths can be created with one side of join being dummy
result path, which are built in the patch by build_dummy_partition_rel().

> 3. It's not entirely clear to me why the patch has to touch
> execPartition.c. That implies that the planner-to-executor API
> changed, but how so, and why is there no comment update clarifying it?

The change is that make_partitionedrel_pruneinfo() no longer adds the OIDs
of pruned partitions to the PartitionPruneInfo.relid_map array, a field
which 898e5e329 added. 898e5e329 had also added an Assert in
ExecCreatePartitionPruneState, which you're seeing is being changed in the
patch. In the case that partition count hasn't changed between planning
and execution (*), it asserts that partition OIDs in
PartitionPruneInfo.relid_map, which is essentially an exact copy of the
partition OIDs in the PartitionDesc that planner retrieved, are same as
the OIDs in the PartitionDesc that executor retrieves.

* this seems a bit shaky, because partition count not having changed
doesn't discount the possibility that partitions themselves haven't changed

> Given the short amount of time left in this CF, there may not be
> time to address the first two points, and I won't necessarily
> insist that those be changed before committing. I'd like at least
> a comment about point 3 though.
>
> Attached is updated patch as a single patch --- I didn't think the
> division into multiple patches was terribly helpful, due to the
> flapping in expected regression results.

Thanks again for the new patch. I'm reading it now and will send comments
later today if I find something.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jamison, Kirk 2019-03-29 04:22:16 RE: Timeout parameters
Previous Message Tom Lane 2019-03-29 03:42:35 Re: Syntax diagrams in user documentation