Speeding up INSERTs and UPDATEs to partitioned tables

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Speeding up INSERTs and UPDATEs to partitioned tables
Date: 2018-06-22 06:28:08
Message-ID: CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

As part of my efforts to make partitioning scale better for larger
numbers of partitions, I've been looking at primarily INSERT VALUES
performance. Here the overheads are almost completely in the
executor. Planning of this type of statement is very simple since
there is no FROM clause to process.

My benchmarks have been around a RANGE partitioned table with 10k leaf
partitions and no sub-partitioned tables. The partition key is a
timestamp column.

I've found that ExecSetupPartitionTupleRouting() is very slow indeed
and there are a number of things slow about it. The biggest culprit
for the slowness is the locking of each partition inside of
find_all_inheritors(). For now, this needs to remain as we must hold
locks on each partition while performing RelationBuildPartitionDesc(),
otherwise, one of the partitions may get dropped out from under us.
There might be other valid reasons too, but please see my note at the
bottom of this email.

The locking is not the only slow thing. I found the following to also be slow:

1. RelationGetPartitionDispatchInfo uses a List and lappend() must
perform a palloc() each time a partition is added to the list.
2. A foreach loop is performed over leaf_parts to search for subplans
belonging to this partition. This seems pointless to do for INSERTs as
there's never any to find.
3. ExecCleanupTupleRouting() loops through the entire partitions
array. If a single tuple was inserted then all but one of the elements
will be NULL.
4. Tuple conversion map allocates an empty array thinking there might
be something to put into it. This is costly when the array is large
and pointless when there are no maps to store.
5. During get_partition_dispatch_recurse(), get_rel_relkind() is
called to determine if the partition is a partitioned table or a leaf
partition. This results in a slow relcache hashtable lookup.
6. get_partition_dispatch_recurse() also ends up just building the
indexes array with a sequence of numbers from 0 to nparts - 1 if there
are no sub-partitioned tables. Doing this is slow when there are many
partitions.

Besides the locking, the only thing that remains slow now is the
palloc0() for the 'partitions' array. In my test, it takes 0.6% of
execution time. I don't see any pretty ways to fix that.

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

The only questionable thing I see is what I did for 6. In partcache.c
I'm basically building an array of nparts containing 0 to nparts -1.
It seems a bit pointless, so perhaps there's a better way. I was also
a bit too tight to memcpy() that out of relcache, and just pointed
directly to it. That might be a no-go area.

I've attached 2 patches:

0001: implements items 1-6
0002: Is not intended for commit. It just a demo of where we could get
the performance if we were smarter about locking partitions. I've just
included this to show 0001's worth.

Performance

AWS: m5d.large fsync=off

Unpatched:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 2836
latency average = 21.162 ms
tps = 47.254409 (including connections establishing)
tps = 47.255756 (excluding connections establishing)

(yes, it's bad)

0001:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 3235
latency average = 18.548 ms
tps = 53.913121 (including connections establishing)
tps = 53.914629 (excluding connections establishing)

(a small improvement from 0001)

0001+0002:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 660079
latency average = 0.091 ms
tps = 11001.303764 (including connections establishing)
tps = 11001.602377 (excluding connections establishing)

(something to aspire towards)

0002 (only):

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 27682
latency average = 2.168 ms
tps = 461.350885 (including connections establishing)
tps = 461.363327 (excluding connections establishing)

(shows that doing 0002 alone does not fix all our problems)

Unpartitioned table (control test):

$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 801260
latency average = 0.075 ms
tps = 13354.311397 (including connections establishing)
tps = 13354.656163 (excluding connections establishing)

Test setup:

CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL);
CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,9999) x;
\gexec
\o

partbench_insert.sql contains:
insert into partbench values('2018-04-26 15:00:00',1,2,3,4,5);

partbench__insert.sql contains:
insert into partbench_ values('2018-04-26 15:00:00',1,2,3,4,5);

I don't want to discuss the locking on this thread. That discussion
will detract from discussing what I'm proposing here... Which is not
to change anything relating to locks. I'm still working on that and
will post elsewhere. Please start another thread if you'd like to
discuss that in the meantime. Feel free to link it in here so others
can follow.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch application/octet-stream 30.5 KB
v1-0002-Unsafe-locking-reduction-for-partitioned-INSERT-U.patch application/octet-stream 2.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajkumar Raghuwanshi 2018-06-22 06:33:44 Re: ERROR: ORDER/GROUP BY expression not found in targetlist
Previous Message Rajkumar Raghuwanshi 2018-06-22 06:28:04 Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"