Re: UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-08-02 10:49:35
Message-ID: CAJ3gD9fZ=uXRpn3_U8=jGJwf0bDxZyCKVULzD4VuN6ACzT+R4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/07/29 2:45, Amit Khandekar wrote:
>> On 28 July 2017 at 20:10, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
>>>> I checked that we get the same result relation order with both the
>>>> patches, but I would like to highlight a notable difference here between
>>>> the approaches taken by our patches. In my patch, I have now taught
>>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>>>> in the tree, because we need to look at its partition descriptor to
>>>> collect partition OIDs and bounds. We can defer locking (and opening the
>>>> relation descriptor of) leaf partitions to a point where planner has
>>>> determined that the partition will be accessed after all (not pruned),
>>>> which will be done in a separate patch of course.
>>>
>>> That's very desirable, but I believe it introduces a deadlock risk
>>> which Amit's patch avoids. A transaction using the code you've
>>> written here is eventually going to lock all partitions, BUT it's
>>> going to move the partitioned ones to the front of the locking order
>>> vs. what find_all_inheritors would do. So, when multi-level
>>> partitioning is in use, I think it could happen that some other
>>> transaction is accessing the table using a different code path that
>>> uses the find_all_inheritors order without modification. If those
>>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
>>
>> Yes, I agree. Even with single-level partitioning, the leaf partitions
>> ordered by find_all_inheritors() is by oid values, so that's also
>> going to be differently ordered.
>
> We do require to lock the parent first in any case. Doesn't that prevent
> deadlocks by imparting an implicit order on locking by operations whose
> locks conflict.

Yes may be, but I am not too sure at this point. find_all_inheritors()
locks only the children, and the parent lock is already locked
separately. find_all_inheritors() does not necessitate to lock the
children with the same lockmode as the parent.

> Having said that, I think it would be desirable for all code paths to
> manipulate partitions in the same order. For partitioned tables, I think
> we can make it the partition bound order by replacing all calls to
> find_all_inheritors and find_inheritance_children on partitioned table
> parents with something else that reads partition OIDs from the relcache
> (PartitionDesc) and traverses the partition tree breadth-first manner.
>
>>> Unfortunately I don't see any easy way around that problem, but maybe
>>> somebody else has an idea.
>>
>> One approach I had considered was to have find_inheritance_children()
>> itself lock the children in bound order, so that everyone will have
>> bound-ordered oids, but that would be too expensive since it requires
>> opening all partitioned tables to initialize partition descriptors. In
>> find_inheritance_children(), we get all oids without opening any
>> tables. But now that I think more of it, it's only the partitioned
>> tables that we have to open, not the leaf partitions; and furthermore,
>> I didn't see calls to find_inheritance_children() and
>> find_all_inheritors() in performance-critical code, except in
>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>> that can change in the future.
>
> This approach more or less amounts to calling the new
> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
> which I posted upthread.) Maybe we can add a wrapper on top, say,
> get_all_partition_oids() which throws away other things that
> RelationGetPartitionDispatchInfo() returned. In addition it locks all the
> partitions that are returned, unlike only the partitioned ones, which is
> what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not certain.

So, I think we can continue the discussion about #1 and #2 in a separate thread.

>
>> Regarding dynamically locking specific partitions as and when needed,
>> I think this method inherently has the issue of deadlock because the
>> order would be random. So it feels like there is no way around other
>> than to lock all partitions beforehand.
>
> I'm not sure why the order has to be random. If and when we decide to
> open and lock a subset of partitions for a given query, it will be done in
> some canonical order as far as I can imagine. Do you have some specific
> example in mind?

Partitioned table t1 has partitions t1p1 and t1p2
Partitioned table t2 at the same level has partitions t2p1 and t2p2
Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
Next insert locks t1p1 because it inserts into t1p1.
But at the same time, somebody does DDL on some parent common to t1
and t2, so it locks the leaf partitions in a fixed specific order,
which would be different than the insert lock order because that order
depended upon the order of tables that the insert rows were routed to.

>
> Thanks,
> Amit
>

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-02 11:35:14 Re: A bug in mapping attributes in ATExecAttachPartition()
Previous Message Mithun Cy 2017-08-02 10:26:21 Re: POC: Cache data in GetSnapshotData()