Re: Small performance tweak to run-time partition pruning

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Small performance tweak to run-time partition pruning
Date: 2018-11-21 14:13:36
Message-ID: CAKJS1f8ZnAW9VJNpJW16t5CtXSq3eAseyJXdumLaYb8DiTbhXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 16 Nov 2018 at 07:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I pushed this with a bit of extra tweaking --- notably, I added an
> Assert to ExecFindMatchingSubPlans to guard against the possibility
> that someone calls it when the remapping hasn't been done.

Thanks for pushing.

> I am not really convinced that the original idea of replacing the
> loop-over-subplans with a bms_next_member() loop is a good one.
> Yeah, it wins in the best case where there are a lot of subplans
> and initial pruning got rid of nearly all of them --- but how often
> is that true, given the context that both do_initial_prune and
> do_exec_prune are needed? And does it still win if most of the
> subplans *didn't* get pruned here? bms_next_member() is not cheap
> compared to an increment-and-compare. I am distressed by the fact
> that your performance testing seems to have examined only this
> best case. IMO performance testing should usually worry more about
> the possible downside in the worst case.

You are correct that a for loop over a Bitmapset with all members set
up to N while testing bms_is_member() is faster than a while loop over
bms_next_member(), however, I don't think the case where many
partitions remain after the initial prune is an interesting one to
focus on. The additional overhead of doing init plan on a large number
of partitions most likely to more than drown out anything we can do to
improve the performance of ExecFindInitialMatchingSubPlans(). As an
example, I tried on master with:

create table listp (b bool, z int) partition by list(b);
create table listp_false partition of listp for values in(false);
create table listp_true partition of listp for values in(true)
partition by list (z);
select 'create table listp_true'||x::Text || ' partition of listp_true
for values in(' || x::text || ');' from generate_series(1,1000) x;
\gexec

Having changed postgresql.conf:

plan_cache_mode = force_generic_plan
max_parallel_workers_per_gather = 0;

I ran this through pgbench:

\set p_b true
prepare q1 (bool) as select * from listp where b = :p_b and z = (select 1);

tps = 230.317940 (excluding connections establishing)

With my original test case, I reported about a 4-microsecond per query
performance improvement from this patch. That does not really account
for very much in a query that takes well over 4000 microseconds to
complete. I've not benchmarked after having reverted the patch as I
don't think it would be easy to measure something that only takes 0.1%
of the total query time.

If we could have pruned all those other partitions during executor startup, aka:

\set p_b true
\set p_z 1
select * from listp where b = :p_b and z = :p_z

then we'd get something more like:

tps = 1952.976045 (excluding connections establishing)

The saving here is not having to init/shutdown nearly as many scan
nodes. Any additional overhead in ExecFindInitialMatchingSubPlans() is
going to be noticed much more for a fast query such as this one, and
in this case, where most or all but one partition was pruned, the
bms_next_member() loop will be significantly faster.

> Rather than arguing about that, though, I think we should focus
> on something else: I'd be a lot happier if this remapping logic
> just disappeared completely. I doubt it's a performance issue
> now that we run it only when initial and exec pruning are both
> needed. But it seems like a source of complexity and bugs, most
> especially the latter now that common cases won't run it. Why
> is it that we can't fix things so that ExecFindMatchingSubPlans
> just uses the original planner output data?

I understand you'd rather not have this code, but I believe it's worth
doing things this way for the added performance improvements we get
from not having to allocate a large empty array to store the
initialized subnodes. I wrote a quick patch to get rid of the re-map
code and just palloc0 an array for all subnodes, initializing only the
ones that were not pruned. Testing this against master and the
patched version with a partitioned table with 10k hash partitions I
get:

$ pgbench -n -f bench.sql -T 60 -M prepared postgres

-- Don't re-map
tps = 95.459176 (excluding connections establishing)
tps = 96.320146 (excluding connections establishing)

-- master
tps = 102.019177 (excluding connections establishing)
tps = 104.052788 (excluding connections establishing)

So there's a non-zero performance penalty if we remove the re-map code.

As far as I see it, we've still got lots to improve in the executor to
further speed this up. Having to populate the large 10k element range
table array in ExecInitRangeTable(), and also run ExecCheckRTPerms()
on the range table that requires no permission checks for any of the
partitions and then also lock each partition in AcquireExecutorLocks()
instead of delaying locking of partitions until after pruning takes
place. I've WIP patches locally to resolve each of these which takes
performance to about 900 tps. The overhead of the palloc0() on the
almost entirely empty Append subnode array becomes a much larger
percentage then.

Perhaps a compromise would to somehow move the re-map code into
partprune.c so as to keep all that logic in the one place.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2018-11-21 14:14:52 Re: Libpq support to connect to standby server as priority
Previous Message Robert Haas 2018-11-21 14:05:08 Re: Libpq support to connect to standby server as priority