Re: UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-07-05 09:42:49
Message-ID: CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 July 2017 at 15:23, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>>> That
>>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>>> that it's got a partitioned table and call
>>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>>> produce the list of OIDs.
>>>> Seems like a good idea.
>>>>
>>>>> Interesting idea.
>>>>>
>>>>> If we are going to do this, I think we may need to modify
>>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>>> does not do as much work. Currently, it assumes that it's only ever
>>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>>> PartitionDispatchInfo objects for partitioned child tables. We don't need
>>>>> that if called from within the planner.
>>>>>
>>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>>> with its usage within the executor, because there is this comment:
>>>>>
>>>>> /*
>>>>> * We keep the partitioned ones open until we're done using the
>>>>> * information being collected here (for example, see
>>>>> * ExecEndModifyTable).
>>>>> */
>>>>
>>>> Yeah, we need some refactoring work. Is anyone working on that?
>>>
>>> I would like to take a shot at that if someone else hasn't already cooked
>>> up a patch. Working on making RelationGetPartitionDispatchInfo() a
>>> routine callable from both within the planner and the executor should be a
>>> worthwhile effort.
>>
>> What I am currently working on is to see if we can call
>> find_all_inheritors() or find_inheritance_children() instead of
>> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
>> Possibly we don't have to refactor it completely.
>> find_inheritance_children() needs to return the oids in canonical
>> order. So in find_inheritance_children () need to re-use part of
>> RelationBuildPartitionDesc() where it generates those oids in that
>> order. I am checking this part, and am going to come up with an
>> approach based on findings.
>
> The other approach is to make canonical ordering only in
> find_all_inheritors() by replacing call to find_inheritance_children()
> with the refactored/modified RelationGetPartitionDispatchInfo(). But
> that would mean that the callers of find_inheritance_children() would
> have one ordering, while the callers of find_all_inheritors() would
> have a different ordering; that brings up chances of deadlocks. That's
> why I think, we need to think about modifying the common function
> find_inheritance_children(), so that we would be consistent with the
> ordering. And then use find_inheritance_children() or
> find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
> there would be some modifications to
> RelationGetPartitionDispatchInfo().
>
>>
>> Also, need to investigate whether *always* sorting the oids in
>> canonical order is going to be much expensive than the current sorting
>> using oids. But I guess it won't be that expensive.

Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering everywhere. But this method is quite
expensive as compared to the way oids are generated and sorted using
oid values in find_inheritance_children().

In both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(), each of the child tables are
opened.

So, in both of these functions, what we can do is : call a new
function partition_tree_walker() which does following :
1. Lock the children using the existing order (i.e. sorted by oid
values) using the same function find_all_inheritors(). Rename
find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
which returns the oid list only if requested.
2. And then scan through each of the partitions in canonical order, by
opening the parent table, then opening the partition descriptor oids,
and then doing whatever needs to be done with that partition rel.

partition_tree_walker() will look something like this :

void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
void (*walker_func) (), void *context)
{
Relation parentrel;
List *rels_list;
ListCell *cell;

(void) lock_all_inheritors(parentOid, lockmode,
false /* don't generate oids */);

parentrel = heap_open(parentOid, NoLock);
rels_list = append_rel_partition_oids(NIL, parentrel);

/* Scan through all partitioned rels, and at the
* same time append their children. */
foreach(cell, rels_list)
{
/* Open partrel without locking; lock_all_inheritors() has locked it */
Relation partrel = heap_open(lfirst_oid(cell), NoLock);

/* Append the children of a partitioned rel to the same list
* that we are iterating on */
if (RelationGetPartitionDesc(partrel))
rels_list = append_rel_partition_oids(rels_list, partrel);

/*
* Do whatever processing needs to be done on this partel.
* The walker function is free to either close the partel
* or keep it opened, but it needs to make sure the opened
* ones are closed later
*/
walker_func(partrel, context);
}
}

List *append_rel_partition_oids(List *rel_list, Relation rel)
{
int i;
for (i = 0; i < rel->rd_partdesc->nparts; i++)
rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);

return rel_list;
}

So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
where expand_rte_walker_func() will do all the work done in the for
loop for each of the partition rels.

Similarly, in RelationGetPartitionDispatchInfo() the initial part
where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
partition_tree_walker(rel, dispatch_info_walkerfunc) where
dispatch_info_walkerfunc() will generate the oids, or may be populate
the complete PartitionDispatchData structure. 'pd' variable can be
passed as context to the partition_tree_walker(..., context)

Generating the resultrels in canonical order by opening the tables
using the above way wouldn't be more expensive than the existing code,
because even currently we anyways have to open all the tables in both
of these functions.

-Amit Khandekar

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-07-05 09:43:13 Re: Multi column range partition table
Previous Message AP 2017-07-05 09:31:40 Re: pgsql 10: hash indexes testing