Re: expanding inheritance in partition bound order

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: expanding inheritance in partition bound order
Date: 2017-08-07 06:54:44
Message-ID: 39f9a6e6-9da5-aafb-3d17-4b8f540d6d15@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017/08/05 2:25, Robert Haas wrote:
> On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> The current way to expand inherited tables, including partitioned tables,
>> is to use either find_all_inheritors() or find_inheritance_children()
>> depending on the context. They return child table OIDs in the (ascending)
>> order of those OIDs, which means the callers that need to lock the child
>> tables can do so without worrying about the possibility of deadlock in
>> some concurrent execution of that piece of code. That's good.
>>
>> For partitioned tables, there is a possibility of returning child table
>> (partition) OIDs in the partition bound order, which in addition to
>> preventing the possibility of deadlocks during concurrent locking, seems
>> potentially useful for other caller-specific optimizations. For example,
>> tuple-routing code can utilize that fact to implement binary-search based
>> partition-searching algorithm. For one more example, refer to the "UPDATE
>> partition key" thread where it's becoming clear that it would be nice if
>> the planner had put the partitions in bound order in the ModifyTable that
>> it creates for UPDATE of partitioned tables [1].
>
> I guess I don't really understand why we want to change the locking
> order. That is bound to make expanding the inheritance hierarchy more
> expensive. If we use this approach in all cases, it seems to me we're
> bound to reintroduce the problem we fixed in commit
> c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
> the same vein. But I don't see that there's any necessary relation
> between the order of locking and the order of expansion inside the
> relcache entry/plan/whatever else -- so I propose that we keep the
> existing locking order and only change the other stuff.

Hmm, okay.

I guess I was trying to fit one solution to what might be two (or worse,
more) problems of the current implementation, which is not good.

> While reading related code this morning, I noticed that
> RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
> *already* changed the locking order for certain operations, because
> the PartitionDesc's OID array is bound-ordered not OID-ordered. That
> means that when RelationGetPartitionDispatchInfo uses the
> PartitionDesc's OID arra to figure out what to lock, it's potentially
> going to encounter partitions in a different order than would have
> been the case if it had used find_all_inheritors directly.

I think Amit Khandekar mentioned this on the UPDATE partition key thread [1].

> I'm
> tempted to think that RelationGetPartitionDispatchInfo shouldn't
> really be doing locking at all. The best way to have the locking
> always happen in the same order is to have only one piece of code that
> determines that order - and I vote for find_all_inheritors. Aside
> from the fact that it's the way we've always done it (and still do it
> in most other places), that code includes infinite-loop defenses which
> the new code you've introduced lacks.

As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.

(Also in the current implementation, we open all the partitions to
construct Var translation lists, which are actually unused through most of
the planner stages, but admittedly it's a separate issue.)

The locking-partitions-too-soon issue, I think, is an important one and
may need discussing separately, but thought I'd mention it anyway. It
also seems somewhat related to this discussion, but I may be wrong.

> Concretely, my proposal is:
>
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).

ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.

BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.

To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).

L2 is the order we put leaf partitions into a given plan.

> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
>
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it. If not, there's no need.

That's what the proposed refactoring patch 0002 actually does.

> 4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
> to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
> elog(ERROR, ...). Might be interesting to stuff that check into the
> relation_open(..., NoLock) path, too.
>
> One objection to this line of attack is that there might be a good
> case for locking only the partitioned inheritors first and then going
> back and locking the leaf nodes in a second pass, or even only when
> required for a particular row. However, that doesn't require putting
> everything in bound order - it only requires moving the partitioned
> children to the beginning of the list. And I think rather than having
> new logic for that, we should teach find_inheritance_children() to do
> that directly. I have a feeling Ashutosh is going to cringe at this
> suggestion, but my idea is to do this by denormalizing: add a column
> to pg_inherits indicating whether the child is of
> RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
> pg_inherits, it can pull that flag out for free along with the
> relation OID, and qsort() first by the flag and then by the OID. It
> can also return the number of initial elements of its return value
> which have that flag set.

Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.

> Then, in find_all_inheritors, we split rels_list into
> partitioned_rels_list and other_rels_list, and process
> partitioned_rels_list in its entirety before touching other_rels_list;
> they get concatenated at the end.
>
> Now, find_all_inheritors and find_inheritance_children can also grow a
> flag bool only_partitioned_children; if set, then we skip the
> unpartitioned children entirely.
>
> With all that in place, you can call find_all_inheritors(blah blah,
> false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
> true) to lock just the partitioned tables in the hierarchy. You get a
> consistent lock order either way, and if you start with only the
> partitioned tables and later want the leaf partitions too, you just go
> through the partitioned children in the order they were returned and
> find_inheritance_children(blah blah, false) on each one of them and
> the lock order is exactly consistent with what you would have gotten
> if you'd done find_all_inheritors(blah blah, false) originally.
>
> Thoughts?

So, with this in place:

1. Call find_all_inheritors to lock partitioned tables in the tree in an
order prescribed by OIDs

2. Call RelationGetPartitionDispatchInfo at an opportune time, which will
generate minimal information about the partition tree that it can do
without having to worry about locking anything

3. Determine the list of which leaf partitions will need to be scanned
using the information obtained in 2, if possible to do that at all [2].

4. Lock leaf partitions in the find_inheritance_children prescribed order,
but only those that are in the list built in 3.

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

Thanks.

Regards,
Amit

[1]
https://www.postgresql.org/message-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB%3DwY9ZLfe8cQOfG4bTqVGyQ%40mail.gmail.com

[2] We can do that in set_append_rel_size(), but not in
inheritance_planner()

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-08-07 11:51:18 Re: Adding hook in BufferSync for backup purposes
Previous Message Etsuro Fujita 2017-08-07 06:45:10 Re: Tuple-routing for certain partitioned tables not working as expected