speeding up planning with partitions

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: speeding up planning with partitions
Date: 2018-08-29 12:06:16
Message-ID: 9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata. However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality. Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.

While situation improved with new pruning where it could, there are still
overheads in the way planner handles partitions. As things stand today,
it will spend cycles and allocate memory for partitions even before
pruning is performed, meaning most of that effort could be for partitions
that were better left untouched. Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos for them,
and finally initialize RelOptInfos for them, even touching the disk file
of each partition in the process, in an earlier phase of planning. All of
that processing is vain for partitions that are pruned, because they won't
be included in the final plan. This problem grows worse as the number of
partitions grows beyond thousands, because range table grows too big.

That could be fixed by delaying all that per-partition activity to a point
where pruning has already been performed, so that we know the partitions
to open and create planning data structures for, such as somewhere
downstream to query_planner. But before we can do that we must do
something about the fact that UPDATE/DELETE won't be able to cope with
that because the code path that currently handles the planning of
UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been
initialized by an earlier planning phase. Delaying it to query_planner
would be too late, because inheritance_planner calls query_planner for
each partition, not for the parent. That is, if query_planner, which is
downstream to inheritance_planner, was in the charge of determining which
partitions to open, the latter wouldn't know which partitions to call the
former for. :)

That would be fixed if there is no longer this ordering dependency, which
is what I propose to do with the attached patch 0001. I've tried to
describe how the patch manages to do that in its commit message, but I'll
summarize here. As things stand today, inheritance_planner modifies the
query for each leaf partition to make the partition act as the query's
result relation instead of the original partitioned table and calls
grouping_planner on the query. That means anything that's joined to
partitioned table looks to instead be joined to the partition and join
paths are generated likewise. Also, the resulting path's targetlist is
adjusted to be suitable for the result partition. Upon studying how this
works, I concluded that the same result can be achieved if we call
grouping_planner only once and repeat the portions of query_planner's and
grouping_planner's processing that generate the join paths and appropriate
target list, respectively, for each partition. That way, we can rely on
query_planner determining result partitions for us, which in turn relies
on the faster partprune.c based method of pruning. That speeds things up
in two ways. Faster pruning and we no longer repeat common processing for
each partition.

With 0001 in place, there is nothing that requires that partitions be
opened by an earlier planning phase, so, I propose patch 0002, which
refactors the opening and creation of planner data structures for
partitions such that it is now performed after pruning. However, it
doesn't do anything about the fact that partitions are all still locked in
the earlier phase.

With various overheads gone thanks to 0001 and 0002, locking of all
partitions via find_all_inheritos can be seen as the single largest
bottleneck, which 0003 tries to address. I've kept it a separate patch,
because I'll need to think a bit more to say that it's actually to safe to
defer locking to late planning, due mainly to the concern about the change
in the order of locking from the current method. I'm attaching it here,
because I also want to show the performance improvement we can expect with it.

I measured the gain in performance due to each patch on a modest virtual
machine. Details of the measurement and results follow.

* Benchmark scripts

update.sql
update ht set a = 0 where b = 1;

select.sql
select * from ht where b = 1;

* Table:

create table ht (a int, b int) partition by hash (b)
create table ht_1 partition of ht for values with (modulus N, remainder 0)
..
create table ht_N partition of ht for values with (modulus N, remainder N-1)

* Rounded tps with update.sql and select.sql against regular table (nparts
= 0) and partitioned table with various partition counts:

pgbench -n -T 60 -f update.sql

nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975

* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)

As can be seen here, 0001 is a big help for update queries.

pgbench -n -T 60 -f select.sql

For a select query that doesn't contain join and needs to scan only one
partition:

nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788
16 711 729 1124 1789
32 450 475 879 1773
64 265 272 603 1765
128 146 149 371 1685
256 76 77 214 1678
512 39 39 112 1636
1024 16 17 59 1525
2048 8 9 29 1416
4096 4 4 15 1195
8192 2 2 7 932

Actually, here we get almost same numbers with 0001 as with master,
because 0001 changes nothing for SELECT queries. We start seeing
improvement with 0002, the patch to delay opening partitions.

Thanks,
Amit

Attachment Content-Type Size
0001-Overhaul-partitioned-table-update-delete-planning.patch text/plain 26.7 KB
0002-Lazy-creation-of-partition-objects-for-planning.patch text/plain 46.6 KB
0003-Only-lock-partitions-that-will-be-scanned-by-a-query.patch text/plain 3.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo Nagata 2018-08-29 12:09:03 Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)
Previous Message Magnus Hagander 2018-08-29 11:47:47 Re: pg_verify_checksums vs windows