Re: UPDATE of partition key

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-08-04 07:41:02
Message-ID: a2775225-9f91-7e12-7d45-41e7abe0694e@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017/08/02 19:49, Amit Khandekar wrote:
> On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> One approach I had considered was to have find_inheritance_children()
>>> itself lock the children in bound order, so that everyone will have
>>> bound-ordered oids, but that would be too expensive since it requires
>>> opening all partitioned tables to initialize partition descriptors. In
>>> find_inheritance_children(), we get all oids without opening any
>>> tables. But now that I think more of it, it's only the partitioned
>>> tables that we have to open, not the leaf partitions; and furthermore,
>>> I didn't see calls to find_inheritance_children() and
>>> find_all_inheritors() in performance-critical code, except in
>>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>>> that can change in the future.
>>
>> This approach more or less amounts to calling the new
>> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
>> which I posted upthread.) Maybe we can add a wrapper on top, say,
>> get_all_partition_oids() which throws away other things that
>> RelationGetPartitionDispatchInfo() returned. In addition it locks all the
>> partitions that are returned, unlike only the partitioned ones, which is
>> what RelationGetPartitionDispatchInfo() has been taught to do.
>
> So there are three different task items here :
> 1. Arrange the oids in consistent order everywhere.
> 2. Prepare the Partition Dispatch Info data structure in the planner
> as against during execution.
> 3. For update tuple routing, assume that the result rels are ordered
> consistently to make the searching efficient.

That's a good breakdown.

> #3 depends on #1. So for that, I have come up with a minimum set of
> changes to have expand_inherited_rtentry() generate the rels in bound
> order. When we do #2 , it may be possible that we may need to re-do my
> changes in expand_inherited_rtentry(), but those are minimum. We may
> even end up having the walker function being used at multiple places,
> but right now it is not certain.

So AFAICS:

For performance reasons, we want the order in which leaf partition
sub-plans appear in the ModifyTable node (and subsequently leaf partition
ResultRelInfos ModifyTableState) to be some known canonical order. That's
because we want to map partitions in the insert tuple-routing data
structure (which appear in a known canonical order as determined by
RelationGetPartititionDispatchInfo) to those appearing in the
ModifyTableState. That's so that we can reuse the planner-generated WCO
and RETURNING lists in the insert code path when update tuple-routing
invokes that path.

To implement that, planner should retrieve the list of leaf partition OIDs
in the same order as ExecSetupPartitionTupleRouting() retrieves them.
Because the latter calls RelationGetPartitionDispatchInfo on the root
partitioned table, maybe the planner should do that too, instead of its
current method getting OIDs using find_all_inheritors(). But it's
currently not possible due to the way RelationGetPartitionDispatchInfo()
and involved data structures are designed.

One way forward I see is to invent new interface functions:

List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expects that users make sure that they are called only
for partitioned tables. Needless to mention, OIDs are returned with
canonical order determined by that of the partition bounds and partition
tree structure. We replace all the calls of the old interface functions
with the respective new ones. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

> So, I think we can continue the discussion about #1 and #2 in a separate thread.

I have started a new thread named "expanding inheritance in partition
bound order" and posted a couple of patches [1].

After applying those patches, you can write code for #3 without having to
worry about the concerns of partition order, which I guess you've already
done.

>>> Regarding dynamically locking specific partitions as and when needed,
>>> I think this method inherently has the issue of deadlock because the
>>> order would be random. So it feels like there is no way around other
>>> than to lock all partitions beforehand.
>>
>> I'm not sure why the order has to be random. If and when we decide to
>> open and lock a subset of partitions for a given query, it will be done in
>> some canonical order as far as I can imagine. Do you have some specific
>> example in mind?
>
> Partitioned table t1 has partitions t1p1 and t1p2
> Partitioned table t2 at the same level has partitions t2p1 and t2p2
> Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
> Next insert locks t1p1 because it inserts into t1p1.
> But at the same time, somebody does DDL on some parent common to t1
> and t2, so it locks the leaf partitions in a fixed specific order,
> which would be different than the insert lock order because that order
> depended upon the order of tables that the insert rows were routed to.

Note that we don't currently do this. That is, lock partitions in an
order determined by incoming rows. ExecSetupPartitionTupleRouting() locks
(RowExclusiveLock) all the partitions beforehand in the partition bound
order. Any future patch that wants to delay locking and opening the
relation descriptor of a leaf partition to when a tuple is actually routed
to it will have to think hard about the deadlock problem you illustrate above.

Aside from the insert case, let's consider locking order when planning a
select on a partitioned table. We currently lock all the partitions in
advance in expand_inherited_rtentry(). When replacing the current method
by some new way, we will first determine all the partitions that satisfy a
given query, collect them in an ordered list (some fixed canonical order),
and lock them in that order.

But maybe, I misunderstood what you said?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-08-04 08:27:35 Small code improvement for btree
Previous Message Amit Langote 2017-08-04 07:38:46 expanding inheritance in partition bound order