Re: Speeding up INSERTs and UPDATEs to partitioned tables

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables
Date: 2018-07-18 08:29:19
Message-ID: 64b668ca-8b35-2355-f937-ac058ef85690@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David,

Thanks for taking a look.

On 2018/07/15 17:34, David Rowley wrote:
> I've looked over the code and the ExecUseUpdateResultRelForRouting()
> function is broken. Your while loop only skips partitions for the
> current partitioned table, it does not skip ModifyTable subnodes that
> belong to other partitioned tables.
>
> You can use the following. The code does not find the t1_a2 subnode.
>
> create table t1 (a int, b int) partition by list(a);
> create table t1_a1 partition of t1 for values in(1) partition by list(b);
> create table t1_a2 partition of t1 for values in(2);
> create table t1_a1_b1 partition of t1_a1 for values in(1);
> create table t1_a1_b2 partition of t1_a1 for values in(2);
> insert into t1 values(2,2);
>
> update t1 set a = a;

Hmm, it indeed is broken.

> I think there might not be enough information to make this work
> correctly, as if you change the loop to skip subnodes, then it won't
> work in cases where the partition[0] was pruned.
>
> I've another patch sitting here, partly done, that changes
> pg_class.relispartition into pg_class.relpartitionparent. If we had
> that then we could code your loop to work correctly.> Alternatively,
> I guess we could just ignore the UPDATE's ResultRelInfos and just
> build new ones. Unsure if there's actually a reason we need to reuse
> the existing ones, is there?

We try to use the existing ones because we thought back when the patch was
written (not by me though) that redoing all the work that
InitResultRelInfo does for each partition, for which we could have instead
used an existing one, would cumulatively end up being more expensive than
figuring out which ones we could reuse by a linear scan of partition and
result rels arrays in parallel. I don't remember seeing a benchmark to
demonstrate the benefit of doing this though. Maybe it was posted, but I
don't remember having looked at it closely.

> I think you'd need to know the owning partition and skip subnodes that
> don't belong to pd->reldesc. Alternatively, a hashtable could be built
> with all the oids belonging to pd->reldesc, then we could loop over
> the update_rris finding subnodes that can be found in the hashtable.
> Likely this will be much slower than the sort of merge lookup that the
> previous code did.

I think one option is to simply give up on the idea of matching *all*
UPDATE result rels that belong to a given partitioned table (pd->reldesc)
in one call of ExecUseUpdateResultRelForRouting. Instead, pass the index
of the partition (in pd->partdesc->oids) to find the ResultRelInfo for,
loop over all UPDATE result rels looking for one, and return immediately
on finding one after having stored its pointer in proute->partitions. In
the worst case, we'll end up scanning UPDATE result rels array for every
partition that gets touched, but maybe such an UPDATE query is less common
or even if such a query occurs, tuple routing might be the last of its
bottlenecks.

I have implemented that approach in the updated patch.

That means I also needed to change things so that
ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition,
because with the new arrangement, it's useless to call it from
ExecSetupPartitionTupleRouting. Moreover, an UPDATE may not use tuple
routing at all, even if the fact that partition key is being updated
results in calling ExecSetupPartitionTupleRouting.

> Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
> The code seems to assume that there can only be at the most 65536
> partitions, but I don't think there's any code which restricts us to
> that. There is code in the planner that will bork when trying to
> create a RangeTblEntry up that high, but as far as I know that won't
> be noticed on the INSERT path. I don't think this code has any
> business knowing what the special varnos are set to either. It would
> be better to just remove the limit and suffer the small wasted array
> space. I understand you've probably coded it like this due to the
> similar code that was in my patch, but with mine I knew the total
> number of partitions. Your patch does not.

OK, I changed it to UINT_MAX.

> Other thoughts on the patch:
>
> I wonder if it's worth having syscache keep a count on the number of
> sub-partitioned tables a partition has. If there are none in the root
> partition then the partition_dispatch_info can be initialized with
> just 1 element to store the root details. Although, maybe it's not
> worth it to reduce the array size by 7 elements.

Hmm yes. Allocating space for 8 pointers when we really need 1 is not too
bad, if the alternative is to modify partcache.c.

> Also, I'm a bit confused why you change the comments in
> execPartition.h for PartitionTupleRouting to be inline again. I
> brought those out of line as I thought the complexity of the code
> warranted that. You're inlining them again goes against what all the
> other structs do in that file.

It was out-of-line to begin with but it started to become distracting when
updating the comments. But I agree about being consistent and hence I
have moved them back to where they were. I have significantly rewritten
those comments though to be clearer.

> Apart from that, I think the idea is promising. We'll just need to
> find a way to make ExecUseUpdateResultRelForRouting work correctly.

Let me know what you think of the code in the updated patch.

Thanks,
Amit

Attachment Content-Type Size
david-0001-delta-v2.patch text/plain 51.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-07-18 08:30:14 Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process
Previous Message Heikki Linnakangas 2018-07-18 08:26:44 Re: cursors with prepared statements