RE: speeding up planning with partitions

From: "Kato, Sho" <kato-sho(at)jp(dot)fujitsu(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: speeding up planning with partitions
Date: 2018-08-31 10:48:55
Message-ID: 25C1C6B2E7BE044889E4FE8643A58BA963AF581E@G01JPEXMBKW03
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Amit

Great!
With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
Applying three all patches, 20% performance improved for 100 partitions.

I have the same number of records for each partition, do you do the same?

Also, in my case, performance was better when not prepare.
I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?

Details of measurement conditions and results are as follows.

- base source
master(@777e6ddf17) + Speeding up Insert v8 patch[1]

- table definition(e.g. 100 partition)
create table test.accounts(aid serial, abalance int) partition by range(aid);
create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);

create table test.account_part_1 partition of test.accounts for values from (1) to (65);
create table test.account_part_2 partition of test.accounts for values from (65) to (129);
...
create table test.account_part_100 partition of test.accounts for values from (6337) to (6400);

create table test.ah_part_1 partition of test.accounts_history for values from (1) to (65);
create table test.ah_part_2 partition of test.accounts_history for values from (65) to (129);
...
create table test.ah_part_100 partition of test.accounts_history for values from (6337) to (6400);

- benchmark script
\set aid random(1, 6400)
\set delta random(-5000, 5000)
BEGIN;
UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM test.accounts WHERE aid = :aid;
INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
END;

- command option
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n

-results
base source no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 662.414805 | 0.357 | 0.265 | 0.421
200 | 494.478431 | 0.439 | 0.349 | 0.579
400 | 307.982089 | 0.651 | 0.558 | 0.927
800 | 191.360676 | 0.979 | 0.876 | 1.548
1600 | 75.344947 | 2.253 | 2.003 | 4.301
3200 | 30.643902 | 5.716 | 4.955 | 10.118
6400 | 16.726056 | 12.512 | 8.582 | 18.054

0001 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 429.816329 | 0.811 | 0.75 | 0.365
200 | 275.211531 | 1.333 | 1.248 | 0.501
400 | 160.499833 | 2.384 | 2.252 | 0.754
800 | 79.387776 | 4.935 | 4.698 | 1.468
1600 | 24.787377 | 16.593 | 15.954 | 4.302
3200 | 9.846421 | 42.96 | 42.139 | 8.848
6400 | 4.919772 | 87.43 | 83.109 | 16.56

0001 prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 245.100776 | 2.728 | 0.374 | 0.476
200 | 140.249283 | 5.116 | 0.603 | 0.686
400 | 67.608559 | 11.383 | 1.055 | 1.179
800 | 23.085806 | 35.781 | 2.585 | 2.677
1600 | 6.211247 | 141.093 | 7.096 | 6.785
3200 | 1.808214 | 508.045 | 15.741 | 13.243
6400 | 0.495635 | 1919.362 | 37.691 | 28.177

0001 + 0002 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 682.53091 | 0.388 | 0.35 | 0.35
200 | 469.906601 | 0.543 | 0.496 | 0.51
400 | 321.915349 | 0.78 | 0.721 | 0.752
800 | 201.620975 | 1.246 | 1.156 | 1.236
1600 | 94.438204 | 2.612 | 2.335 | 2.745
3200 | 38.292922 | 6.657 | 5.579 | 6.808
6400 | 21.48462 | 11.989 | 10.104 | 12.601

0001 + 0002 prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 591.10863 | 0.433 | 0.342 | 0.422
200 | 393.223638 | 0.625 | 0.522 | 0.614
400 | 253.672736 | 0.946 | 0.828 | 0.928
800 | 146.840262 | 1.615 | 1.448 | 1.604
1600 | 52.805593 | 4.656 | 3.811 | 4.473
3200 | 21.461606 | 11.48 | 9.56 | 10.661
6400 | 11.888232 | 22.869 | 16.841 | 18.871

0001 + 0002 + 0003 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 798.962029 | 0.304 | 0.267 | 0.339
200 | 577.893396 | 0.384 | 0.346 | 0.487
400 | 426.542177 | 0.472 | 0.435 | 0.717
800 | 288.616213 | 0.63 | 0.591 | 1.162
1600 | 154.247034 | 1.056 | 0.987 | 2.384
3200 | 59.711446 | 2.416 | 2.233 | 6.514
6400 | 37.109761 | 3.387 | 3.099 | 11.762

0001 + 0002 + 0003 prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 662.414805 | 0.357 | 0.265 | 0.421
200 | 494.478431 | 0.439 | 0.349 | 0.579
400 | 307.982089 | 0.651 | 0.558 | 0.927
800 | 191.360676 | 0.979 | 0.876 | 1.548
1600 | 75.344947 | 2.253 | 2.003 | 4.301
3200 | 30.643902 | 5.716 | 4.955 | 10.118
6400 | 16.726056 | 12.512 | 8.582 | 18.054


Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.

11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+----------------+----------------+----------------
100 | 756.07228 | 0.942 | 0.091 | 0.123

[1] https://www.postgresql.org/message-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U%3D_Dj0KUARJiuGpOA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com

regards,
Sho Kato
-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp]
Sent: Wednesday, August 29, 2018 9:06 PM
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: speeding up planning with partitions

It is more or less well known that the planner doesn't perform well with more than a few hundred partitions even when only a handful of partitions are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one using constraint exclusion with a much faster method that finds relevant partitions by using partitioning metadata. However, we could only use it for SELECT queries, because UPDATE/DELETE are handled by a completely different code path, whose structure doesn't allow it to call the new pruning module's functionality. Actually, not being able to use the new pruning is not the only problem for UPDATE/DELETE, more on which further below.

While situation improved with new pruning where it could, there are still overheads in the way planner handles partitions. As things stand today, it will spend cycles and allocate memory for partitions even before pruning is performed, meaning most of that effort could be for partitions that were better left untouched. Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos for them, and finally initialize RelOptInfos for them, even touching the disk file of each partition in the process, in an earlier phase of planning. All of that processing is vain for partitions that are pruned, because they won't be included in the final plan. This problem grows worse as the number of partitions grows beyond thousands, because range table grows too big.

That could be fixed by delaying all that per-partition activity to a point where pruning has already been performed, so that we know the partitions to open and create planning data structures for, such as somewhere downstream to query_planner. But before we can do that we must do something about the fact that UPDATE/DELETE won't be able to cope with that because the code path that currently handles the planning of UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been initialized by an earlier planning phase. Delaying it to query_planner would be too late, because inheritance_planner calls query_planner for each partition, not for the parent. That is, if query_planner, which is downstream to inheritance_planner, was in the charge of determining which partitions to open, the latter wouldn't know which partitions to call the former for. :)

That would be fixed if there is no longer this ordering dependency, which is what I propose to do with the attached patch 0001. I've tried to describe how the patch manages to do that in its commit message, but I'll summarize here. As things stand today, inheritance_planner modifies the query for each leaf partition to make the partition act as the query's result relation instead of the original partitioned table and calls grouping_planner on the query. That means anything that's joined to partitioned table looks to instead be joined to the partition and join paths are generated likewise. Also, the resulting path's targetlist is adjusted to be suitable for the result partition. Upon studying how this works, I concluded that the same result can be achieved if we call grouping_planner only once and repeat the portions of query_planner's and grouping_planner's processing that generate the join paths and appropriate target list, respectively, for each partition. That way, we can rely on query_planner determining result partitions for us, which in turn relies on the faster partprune.c based method of pruning. That speeds things up in two ways. Faster pruning and we no longer repeat common processing for each partition.

With 0001 in place, there is nothing that requires that partitions be opened by an earlier planning phase, so, I propose patch 0002, which refactors the opening and creation of planner data structures for partitions such that it is now performed after pruning. However, it doesn't do anything about the fact that partitions are all still locked in the earlier phase.

With various overheads gone thanks to 0001 and 0002, locking of all partitions via find_all_inheritos can be seen as the single largest bottleneck, which 0003 tries to address. I've kept it a separate patch, because I'll need to think a bit more to say that it's actually to safe to defer locking to late planning, due mainly to the concern about the change in the order of locking from the current method. I'm attaching it here, because I also want to show the performance improvement we can expect with it.

I measured the gain in performance due to each patch on a modest virtual machine. Details of the measurement and results follow.

* Benchmark scripts

update.sql
update ht set a = 0 where b = 1;

select.sql
select * from ht where b = 1;

* Table:

create table ht (a int, b int) partition by hash (b) create table ht_1 partition of ht for values with (modulus N, remainder 0) ..
create table ht_N partition of ht for values with (modulus N, remainder N-1)

* Rounded tps with update.sql and select.sql against regular table (nparts = 0) and partitioned table with various partition counts:

pgbench -n -T 60 -f update.sql

nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975

* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)

As can be seen here, 0001 is a big help for update queries.

pgbench -n -T 60 -f select.sql

For a select query that doesn't contain join and needs to scan only one
partition:

nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788
16 711 729 1124 1789
32 450 475 879 1773
64 265 272 603 1765
128 146 149 371 1685
256 76 77 214 1678
512 39 39 112 1636
1024 16 17 59 1525
2048 8 9 29 1416
4096 4 4 15 1195
8192 2 2 7 932

Actually, here we get almost same numbers with 0001 as with master, because 0001 changes nothing for SELECT queries. We start seeing improvement with 0002, the patch to delay opening partitions.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-08-31 10:50:32 Re: Startup cost of sequential scan
Previous Message Maksim Milyutin 2018-08-31 10:41:35 Re: Hint to set owner for tablespace directory