Re: Performance improvements of INSERTs to a partitioned table

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, "Kato, Sho" <kato-sho(at)jp(dot)fujitsu(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Performance improvements of INSERTs to a partitioned table
Date: 2018-11-09 05:45:18
Message-ID: 16db1458-67cf-4add-736e-31b053115e8e@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018/11/09 11:19, David Rowley wrote:
> On 7 November 2018 at 21:31, Kato, Sho <kato-sho(at)jp(dot)fujitsu(dot)com> wrote:
>> AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are
>> executed at the same time, this patch causes deadlock.
>>
>> * partitions information
>>
>> Partition key: RANGE (a)
>> Partitions: a_1 FOR VALUES FROM (1) TO (100),
>> a_2 FOR VALUES FROM (100) TO (200)
>>
>> T1: create index a_1_ix on a_1(a);
>> T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s
>> lock
>> T1: create index a_2_ix on a_2(a); ERROR: deadlock detected
>>
>> I think this situation does not mean unsafe because similar situation will
>> occurs at no partitioned tables and DBMS could not prevent this situation.
>>
>> But, I'm not sure this behavior is correct.
>
> That's one case that will deadlock with the 0002 patch in [1]. Another
> command that takes a conflicting lock on the partition only is
> TRUNCATE. Many other commands that normally take an AEL can't be run
> on a partition, for example, ALTER TABLE ADD COLUMN.
>
> The same would have deadlocked on master if you'd created the indexes
> in the reverse order, but I guess the point is that today there's some
> defined order that you can create the indexes in that won't cause a
> deadlock, but with [1] there is no defined order since the tables are
> locked in order that tuples are routed to them.

We cannot control the order in which a user performs operations directly
on partitions. What we do have to worry about is the order in which
partitions are locked by a particular piece of backend code when a certain
operation is applied to the parent table. However, if the user had
applied the operation to the parent table instead, then it would've
blocked for any concurrent queries that already had a lock on the parent,
or vice versa, because obviously the table named in the command is locked
before any of its children.

> Robert pointed out something interesting in [2] about UPDATEs of a
> partition key perhaps routing a tuple to a partition that's been
> excluded by constraint exclusion, so the lock is not taken initially> but would be taken later in ExecSetupPartitionTupleRouting(). I ran
> through that scenario today and discovered it can't happen as excluded
> partitions are still in the rangetable, so are still locked either in
> the planner or by AcquireExecutorLocks() and the order those are
> locked in is well defined from the planner.

Yes, expand_inherited_rtentry() would lock *all* partitions even if some
of them might not end up in the final plan due to some being excluded.

> However, I believe that's
> something Amit plans to change in [3], where he proposes to only lock
> partitions that survive partition pruning (among many other things).

With my patch, while the planner would take the locks in deterministic
order on the partitions that it adds to the plan, the executor time
locking order is non-deterministic considering that tuple routing may
encounter multiple, not yet seen, partitions.

> There are probably a few things we could do here:
>
> a. Have all DDL on partitions obtain the same lock level on all
> ancestor partitioned tables taking a lock on the root first and
> working down to the leaf.
> b. Just lock what we touch and increase the likelihood of deadlocks occurring.
> c. Reject [1] and [3] because we don't want to increase the chances of
> deadlocks.
>
> I'm highly against doing 'c' since both [1] and [3] are very useful
> patches which scale partitioning to work well with very large numbers
> of partitions. At the moment we just can bearly do half a dozen
> partitions before the performance starts going south.

Agreed.

> 'a' is not that
> ideal a solution either as it means that anyone doing CREATE INDEX or
> TRUNCATE on a partition would conflict with queries that are querying
> an unrelated part of the partitioned table.

As you pointed out, many of the commands that require locks that would
conflict with concurrent queries are not allowed to run directly on
partitions anyway. For operations that *are* allowed, doing 'a' is same
as asking the user to perform the operation on the root parent instead, at
least as far as the concurrency is concerned (user may not want to
*truncate* all partitions, only some.)

> While I don't really like
> 'b' either, I wonder how many people are going to hit deadlocks here.
> To hit that, I believe that you'd have to be doing the TRUNCATE or
> CREATE INDEX inside a transaction and to hit it where you couldn't hit
> it before you'd have had to carefully written your CREATE INDEX /
> TRUNCATE statements to ensure they're executed in partition order. I'm
> unsure how likely that is, but I certainly can't rule out that doing
> 'b' won't upset anyone.

As long as queries involve tuple routing that touches multiple not yet
seen partitions, someone doing conflicting operations directly on multiple
partitions in a transaction will have to be ready to see deadlocks.
Maybe, we can document that.

> Perhaps a GUC is needed to choose between 'a' and 'b'?

I guess we'll have a hard time explaining such configuration option to
users though.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-11-09 05:56:34 Re: csv format for psql
Previous Message Kyotaro HORIGUCHI 2018-11-09 05:39:31 Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT