Re: Speeding up INSERTs and UPDATEs to partitioned tables

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables
Date: 2018-07-13 08:20:47
Message-ID: 5209d0bf-869a-e099-5580-4f48b41f4c41@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David.

On 2018/06/22 15:28, David Rowley wrote:
> Hi,
>
> As part of my efforts to make partitioning scale better for larger
> numbers of partitions, I've been looking at primarily INSERT VALUES
> performance. Here the overheads are almost completely in the
> executor. Planning of this type of statement is very simple since
> there is no FROM clause to process.

Thanks for this effort.

> My benchmarks have been around a RANGE partitioned table with 10k leaf
> partitions and no sub-partitioned tables. The partition key is a
> timestamp column.
>
> I've found that ExecSetupPartitionTupleRouting() is very slow indeed
> and there are a number of things slow about it. The biggest culprit
> for the slowness is the locking of each partition inside of
> find_all_inheritors().

Yes. :-(

> For now, this needs to remain as we must hold
> locks on each partition while performing RelationBuildPartitionDesc(),
> otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode. If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid. If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

> The locking is not the only slow thing. I found the following to also be slow:
>
> 1. RelationGetPartitionDispatchInfo uses a List and lappend() must
> perform a palloc() each time a partition is added to the list.
> 2. A foreach loop is performed over leaf_parts to search for subplans
> belonging to this partition. This seems pointless to do for INSERTs as
> there's never any to find.
> 3. ExecCleanupTupleRouting() loops through the entire partitions
> array. If a single tuple was inserted then all but one of the elements
> will be NULL.
> 4. Tuple conversion map allocates an empty array thinking there might
> be something to put into it. This is costly when the array is large
> and pointless when there are no maps to store.
> 5. During get_partition_dispatch_recurse(), get_rel_relkind() is
> called to determine if the partition is a partitioned table or a leaf
> partition. This results in a slow relcache hashtable lookup.
> 6. get_partition_dispatch_recurse() also ends up just building the
> indexes array with a sequence of numbers from 0 to nparts - 1 if there
> are no sub-partitioned tables. Doing this is slow when there are many
> partitions.
>
> Besides the locking, the only thing that remains slow now is the
> palloc0() for the 'partitions' array. In my test, it takes 0.6% of
> execution time. I don't see any pretty ways to fix that.
>
> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that. I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch. We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand. So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree. Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet. In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and store it in the 0th slot in
proute->partitions or proute->partition_dispatch_info, respectively.
Also, we update the indexes array in the parent's PartitionDispatch to
replace -1 with 0 so that future tuples routing to that partition don't
allocate it again. The process is repeated if the tuple needs to be
routed one more level down. If the query needs to allocate more
ResultRelInfos and/or PartitionDispatch objects than we initially
allocated space for, we expand those arrays. Finally, during
ExecCleanupTupleRouting, we only "clean up" the partitions that we
allocated ResultRelInfos and PartitionDispatch objects for, which is very
similar to the partitions_init idea in your patch.

I implemented that idea in the attached patch, which applies on top of
your 0001 patch, but I'd say it's too big to be just called a delta. I
was able to get following performance numbers using the following pgbench
test:

pgbench -n -T 180 -f insert-ht.sql
cat insert-ht.sql
\set b random(1, 1000)
\set a random(1, 1000)
insert into ht values (:b, :a);

Note that pgbench is run 3 times and every tps result is listed below.

HEAD - 0 parts (unpartitioned table)
tps = 2519.603076 (including connections establishing)
tps = 2486.903189 (including connections establishing)
tps = 2518.771182 (including connections establishing)

HEAD - 2500 hash parts (no subpart)
tps = 13.158224 (including connections establishing)
tps = 12.940713 (including connections establishing)
tps = 12.882465 (including connections establishing)

David - 2500 hash parts (no subpart)
tps = 18.717628 (including connections establishing)
tps = 18.602280 (including connections establishing)
tps = 18.945527 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 18.576858 (including connections establishing)
tps = 18.431902 (including connections establishing)
tps = 18.797023 (including connections establishing)

HEAD - 2500 hash parts (4 hash subparts each)
tps = 2.339332 (including connections establishing)
tps = 2.339582 (including connections establishing)
tps = 2.317037 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 3.225044 (including connections establishing)
tps = 3.214053 (including connections establishing)
tps = 3.239591 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
tps = 3.321918 (including connections establishing)
tps = 3.305952 (including connections establishing)
tps = 3.301036 (including connections establishing)

Applying the lazy locking patch on top of David's and my patch,
respectively, produces the following results.

David - 2500 hash parts (no subpart)
tps = 1577.854360 (including connections establishing)
tps = 1532.681499 (including connections establishing)
tps = 1464.254096 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 1532.475751 (including connections establishing)
tps = 1534.650325 (including connections establishing)
tps = 1527.840837 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 78.845916 (including connections establishing)
tps = 79.167079 (including connections establishing)
tps = 79.621686 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
9:tps = 329.887008 (including connections establishing)
9:tps = 327.428103 (including connections establishing)
9:tps = 326.863248 (including connections establishing)

About the last two results: after getting rid of the time-hog that is
find_all_inheritors() call in ExecSetupPartitionTupleRouting for locking
all partitions, it seems that we'll end up spending most of the time in
RelationGetPartitionDispatchInfo() without my patch, because it will call
get_partition_dispatch_recurse() for each of the 2500 partitions of first
level that are themselves partitioned. With my patch, we won't do that
and won't end up generating 2499 PartitionDispatch objects that would not
be needed for a single-row insert statement.

Thanks,
Amit

Attachment Content-Type Size
david-0001-delta.patch text/plain 49.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-07-13 08:22:21 Re: file cloning in pg_upgrade and CREATE DATABASE
Previous Message Pierre Ducroquet 2018-07-13 08:20:42 [PATCH] LLVM tuple deforming improvements