Re: Speeding up INSERTs and UPDATEs to partitioned tables

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables
Date: 2018-07-27 07:11:11
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2018/07/27 1:19, David Rowley wrote:
> On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Let me know what you think of the code in the updated patch.
> Thanks for sending the updated patch.
> I looked over it tonight and made a number of changes:
> 1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
> useless since the int would have wrapped long before it reached

Oops, you're right.

> There's no shortage of other code doubling the size of an
> array by multiplying it by 2 unconditionally without considering
> overflowing an int. Unsure why you considered this more risky.

Just ill-informed paranoia on my part. Let's just drop it as you say,
given also the Tom's comment that repalloc would fail anyway for requests
over 1GB.

> 2) Fixed a series of bugs regarding the size of the arrays in
> PartitionTupleRouting. The map arrays and the partitions array could
> differ in size despite your comment that claimed
> child_parent_tupconv_maps was the same size as 'partitions' when
> non-NULL. The map arrays being a different size than the partitions
> array caused the following two cases to segfault. I've included two
> cases as it was two seperate bugs that caused them.
> -- case 1

[ .... ]

> -- case 2

Indeed, there were some holes in the logic that led to me to come up with
that code.

> 3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
> this to remove references to UPDATE in order to make it more friendly
> towards other possible future node types that it would get used for
> (aka MERGE). In the end, I found that performance could regress when
> in cases like:
> drop table listp;
> create table listp (a int) partition by list(a);
> \o /dev/null
> \timing off
> select 'create table listp'||x::Text||' partition of listp for values
> in('||x::Text||');' from generate_series(1,1000) x;
> \gexec
> \o
> insert into listp select x from generate_series(1,999) x;
> \timing on
> update listp set a = a+1;
> It's true that UPDATEs with a large number of subplans performance is
> quite terrible today in the planner, but this code made the
> performance of planning+execution a bit worse. If we get around to
> fixing the inheritance planner then I think
> ExecUseUpdateResultRelForRouting() could easily appear in profiles.
> I ended up rewriting it to just get called once and build a hash table
> by Oid storing a ResultRelInfo pointer. This also gets rid of the
> slow nested loop in the cleanup operation inside
> ExecCleanupTupleRouting().

OK, looks neat, although I'd name the hash table subplan_resultrel_hash
(like join_rel_hash in PlannerInfo), instead of subplan_partition_table.

> 4) Did some tuning work in ExecFindPartition() getting rid of a
> redundant check after the loop completion. Also added some likely()
> and unlikely() decorations around some conditions.

All changes seem good.

> 5) Updated some newly out-dated comments since your patch in execPartition.h.
> 6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
> palloc() updating the few fields that were not initialised. This might
> save a few TPS (at least once we get rid of the all partition locking)
> in the single-row INSERT case, but I've not tested the performance of
> this yet.
> 7) Also moved and edited some comments above
> ExecSetupPartitionTupleRouting() that I felt explained a little too
> much about some internal implementation details.

Thanks, changes look good.

> One thing that I thought of, but didn't do was just having
> ExecFindPartition() return the ResultRelInfo. I think it would be much
> nicer in both call sites to not have to check the ->partitions array
> to get that. The copy.c call site would need a few modifications
> around the detection code to see if the partition has changed, but it
> all looks quite possible to change. I left this for now as I have
> another patch which touches all that code that I feel is closer to
> commit than this patch is.

I had wondered about that too, but gave up on doing anything about it
because the callers of ExecFindPartition want to access other fields of
PartitionTupleRouting using the returned index. Maybe, we could make it
return a ResultRelInfo * and also return the index itself using a separate
output argument. Seems like a cosmetic improvement that can be made later.

> I've attached a delta of the changes I made since your v2 delta and
> also a complete updated patch.

Thanks. Here are some other minor comments on the complete v2 patch.

- tuple =
+ tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
proute->parent_child_tupconv_maps[leaf_part_index] :

This piece of code that's present in both ExecPrepareTupleRouting and
CopyFrom can be written as:

if (proute->parent_child_tupconv_maps)

+ /*
+ * If UPDATE needs to do tuple routing, we'll need a slot that will
+ * transiently store the tuple being routed using the root parent's
+ * rowtype. We must set up at least this slot, because it's needed even
+ * before tuple routing begins. Other necessary information is
+ * initialized when tuple routing code calls
+ * ExecUseUpdateResultRelForRouting.
+ */
if (node && node->operation == CMD_UPDATE)

This comment needs to be updated, because you changed the if block's body as:

+ ExecHashSubPlanResultRelsByOid(mtstate, proute);
proute->root_tuple_slot = MakeTupleTableSlot(NULL);

So, we don't just set up the slot here, we also now set up the hash table
to store sub-plan result rels. Also, ExecUseUpdateResultRelForRouting no
longer exist.

+ /*
+ * Get the index for PartitionTupleRouting->partitions array
+ * for this leaf partition. This may require building a new
+ * ResultRelInfo.
+ */

1st sentence reads a bit strange. Did you mean to write the following?

* Get this leaf partition's index in the
* PartitionTupleRouting->partitions array.
* This may require building a new ResultRelInfo.

The following block of code could use a one-line comment describing what's
going on (although, what's going on might be pretty clear to some eyes
just by looking at the code):

if (proute->subplan_partition_table)
ResultRelInfo *rri;
Oid partoid = partdesc->oids[partidx];

rri = hash_search(proute->subplan_partition_table,
&partoid, HASH_FIND, NULL);

+ * ExecInitPartitionDispatchInfo
+ * Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */

You renamed 'dispatchidx' to 'partidx' in the function's signature but
forgot to update this comment.

I've attached a delta patch to make the above changes. I'm leaving the
hash table rename up to you though.


Attachment Content-Type Size
v2-delta.patch text/plain 4.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2018-07-27 07:17:04 Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Previous Message Toshi Harada 2018-07-27 07:08:13 Re: "WIP: Data at rest encryption" patch and, 2 phase commit.