Quick Links

Performance of the partitioning in the large scale

From:	"Kato, Sho" <kato-sho(at)jp(dot)fujitsu(dot)com>
To:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Performance of the partitioning in the large scale
Date:	2018-09-27 02:14:50
Message-ID:	25C1C6B2E7BE044889E4FE8643A58BA963B1F96F@G01JPEXMBKW03
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Thanks to your efforts the performance of the partitioned table has improved greatly.
Since I evaluated the performance by combining the performance improvement patches proposed in PG12, I share it.

The purpose of this evaluation is to organize performance issues of the table which is partitioned large number like over thousand and I want to bring the partitioned table performance close to the no partitioned table performance.

I used pgbench.
On the premise that tables don't have data and plan_cache_mode is auto.

- source:

master(38763d6778 Date: Thu Sep 20 15:52:39 2018 +1200) + v9 patch[1] + v5 patch[2] + v3 patch[3]

[1] Reduce partition tuple routing overheads
https://commitfest.postgresql.org/19/1690/

[2] Revise executor's handling of range table relations
https://commitfest.postgresql.org/19/1758/

[3] Speed up planning with partitions
https://commitfest.postgresql.org/19/1778/

- table definition:

When 6400 items of data is inserted to parent table, the each leaf partitions have the same number of items.
For example, the following DDL is the number of leaf partitions is a hundred.

create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);

create table test.account_part_1 partition of test.accounts for values from (1) to (65);
create table test.account_part_2 partition of test.accounts for values from (65) to (129);
...
create table test.account_part_100 partition of test.accounts for values from (6337) to (6400);

create table test.ah_part_1 partition of test.accounts_history for values from (1) to (65);
create table test.ah_part_2 partition of test.accounts_history for values from (65) to (129);
...
create table test.ah_part_100 partition of test.accounts_history for values from (6337) to (6400);

- benchmark script:

I make SQL which only one leaf partition is targeted in each case of SELECT/UPDATE/DELETE/INSERT.

\set aid random(1, 6400)
\set delta random(-5000, 5000)

SELECT abalance FROM test.accounts WHERE aid = :aid;
or
INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
or
UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
or
DELETE FROM test.accounts where aid = :aid;

- results:

1. simple mode results:

part_num is the number of partition and 0 means no partitioned case.
tps_ex is tps of excluding connections establishing.

Also, after pgbench, I evaluate Planning time and Execution Time using explain analyze.
plan_time_avg and execute_time_avg are average of explain analyze when executed ten times.

pgbench -n -T 60 -r -f select.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+-------------+---------------+------------------
0 | 8285.83582 | 0.0528 | 0.0222
100 | 5948.1711 | 0.1342 | 0.0306
200 | 5436.438478 | 0.15 | 0.0298
400 | 4523.867744 | 0.148 | 0.03
800 | 3460.625739 | 0.1447 | 0.0305
1600 | 2428.795542 | 0.1528 | 0.0303
3200 | 1539.672214 | 0.1552 | 0.0316
6400 | 880.965232 | 0.1704 | 0.0288
(8 rows)

pgbench -n -T 60 -r -f update.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+-------------+---------------+------------------
0 | 7360.58261 | 0.0596 | 0.0417
100 | 4633.880563 | 0.1564 | 0.105
200 | 3972.737702 | 0.152 | 0.1007
400 | 3000.23471 | 0.1594 | 0.1039
800 | 2139.676379 | 0.1664 | 0.1055
1600 | 1348.553673 | 0.165 | 0.1056
3200 | 787.48559 | 0.1774 | 0.1124
6400 | 411.575671 | 0.1823 | 0.1089
(8 rows)

pgbench -n -T 60 -r -f delete.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+-------------+---------------+------------------
0 | 8133.84019 | 0.057 | 0.0403
100 | 5150.452458 | 0.1398 | 0.0936
200 | 4352.69018 | 0.1414 | 0.0964
400 | 3298.86364 | 0.1459 | 0.099
800 | 2245.946178 | 0.1559 | 0.1029
1600 | 1386.92366 | 0.1591 | 0.1048
3200 | 802.024765 | 0.1617 | 0.1042
6400 | 407.214158 | 0.168 | 0.1087
(8 rows)

pgbench -n -T 60 -r -f insert.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+-------------+---------------+------------------
0 | 5246.142416 | 0.0307 | 0.0601
100 | 2190.331571 | 0.0311 | 0.3587
200 | 1452.601752 | 0.0301 | 0.5065
400 | 863.771879 | 0.031 | 0.7864
800 | 482.528223 | 0.0308 | 1.361
1600 | 254.026173 | 0.0325 | 2.4999
3200 | 133.04952 | 0.0327 | 4.94
6400 | 68.541207 | 0.0335 | 9.8271
(8 rows)

SELECT/UPDATE/DELETE case

Planning Time and Execution Time are almost same. but, tps decreases as the number of partitions increases. why?

INSERT case

Execution Time is longer as the number of partitions increase.
This is why find_all_inheritors locks all leaf partitions. Improvement idea was already proposed in [1] but, not discussion started yet.

2. prepared mode results:

part_num and tps_ex are same with simple mode results.
plan_time_avg and execute_time are the average of the 7th to 10th out of 10 times.

pgbench -n -M prepared -T 60 -r -f select.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+-------------+---------------+------------------
0 | 10939.35898 | 0.007 | 0.021
100 | 4705.814419 | 0.063 | 0.025
200 | 3884.922764 | 0.067 | 0.0255
400 | 2762.569518 | 0.06575 | 0.0255
800 | 1826.824659 | 0.07 | 0.02625
1600 | 1119.961247 | 0.065 | 0.021
3200 | 588.001498 | 0.08725 | 0.02775
6400 | 218.817738 | 0.1325 | 0.03325
(8 rows)

pgbench -n -M prepared -T 60 -r -f update.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+--------------+---------------+------------------
0 | 10996.995001 | 0.01475 | 0.029
100 | 3711.790226 | 0.07975 | 0.074
200 | 2724.792289 | 0.0865 | 0.1055
400 | 1837.348771 | 0.09175 | 0.104
800 | 1105.378276 | 0.0925 | 0.10425
1600 | 619.924085 | 0.09425 | 0.1115
3200 | 289.202665 | 0.109 | 0.12475
6400 | 110.149098 | 0.1695 | 0.1585
(8 rows)

pgbench -n -M prepared -T 60 -r -f delete.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+--------------+---------------+------------------
0 | 11398.458657 | 0.009 | 0.02625
100 | 3997.105459 | 0.072 | 0.08675
200 | 3124.704202 | 0.0745 | 0.09725
400 | 2113.749027 | 0.083 | 0.10125
800 | 1365.142607 | 0.081 | 0.107
1600 | 740.248148 | 0.0895 | 0.10925
3200 | 363.054579 | 0.09525 | 0.11375
6400 | 124.795721 | 0.14725 | 0.158
(8 rows)

pgbench -n -M prepared -T 60 -r -f insert.sql

part_num | tps_ex | plan_time_avg | execute_time_avg
----------+-------------+---------------+------------------
0 | 6308.66377 | 0.01175 | 0.04225
100 | 2476.919703 | 0.014 | 0.26375
200 | 1569.59956 | 0.01325 | 0.38625
400 | 912.4026 | 0.014 | 0.6055
800 | 502.504371 | 0.0145 | 1.05725
1600 | 259.97686 | 0.01425 | 1.9835
3200 | 136.809811 | 0.01225 | 3.877
6400 | 70.489791 | 0.0105 | 7.76575
(8 rows)

SELECT/UPDATE/DELETE case

As with simple mode results, tps decreases as the number of partitions increases. why?
Also, I wonder why no prepared case is faster than prepared case.

When I see all 10 times results of the explain analyze, I realize Planning Time of 6th explain analyze is much longer.
why does it take so long to build a new generic plan?

part_num | num | plan_time | execute_time
----------+-----+-----------+--------------
6400 | 1 | 0.459 | 0.043
6400 | 2 | 0.168 | 0.028
6400 | 3 | 0.169 | 0.028
6400 | 4 | 0.151 | 0.027
6400 | 5 | 0.148 | 0.027
6400 | 6 | 449.213 | 0.052
6400 | 7 | 0.142 | 0.033
6400 | 8 | 0.12 | 0.034
6400 | 9 | 0.136 | 0.033
6400 | 10 | 0.132 | 0.033

INSERT case

I think the cause of performance degradation is the same as simple mode.

In the case of prepared statement, even if the target of SELECT statement is one leaf partition, since LockRelationOid of AcquireExecutorLocks is executed for all leaves, the performance decreases as the number of partitions increases.
Can AcquireExecutorLocks lock only the target leaf partition?

I am planning to investigate using a system TAP etc. for other bottlenecks.
If you have any other convenient method, please let me know.
Also, if there is something already known as a bottleneck, please let me know.

regards,

--
Sho Kato

Responses

Re: Performance of the partitioning in the large scale at 2018-09-27 21:02:56 from David Rowley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2018-09-27 03:21:35	Re: Performance improvements for src/port/snprintf.c
Previous Message	Andres Freund	2018-09-27 02:05:20	Re: Performance improvements for src/port/snprintf.c