Re: UPDATE of partition key

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-11-02 07:10:44
Message-ID: c5e1d4ad-d243-52c5-608b-5dbb7183e465@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit.

Thanks a lot for updated patches and sorry that I couldn't get to looking
at your emails sooner. Note that I'm replying here to both of your
emails, but looking at only the latest v22 patch.

On 2017/10/24 0:15, Amit Khandekar wrote:
> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>
>> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>> == NULL))))
>>
>> Is there some reason why a bitwise operator is used here?
>
> That exact condition means that the function is called for transition
> capture for updated rows being moved to another partition. For this
> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
> capture that condition there. I think the bitwise operator is more
> user-friendly in emphasizing the point that it is indeed an "either a
> or b, not both" condition.

I see. In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

>> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
>> + * with on entry for every leaf partition (required to convert input
>> tuple
>> + * based on the root table's rowtype to a leaf partition's rowtype after
>> + * tuple routing is done)
>>
>> Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that
>> they are maps needed for "tuple conversion". And the other field holding
>> the reverse map as leaf_rev_tupconv_maps. Either that or use underscores
>> to separate words, but then it gets too long I guess.
>
> In master branch, now this param is already there with the name
> "tup_conv_maps". In the rebased version in the earlier mail, I haven't
> again changed it. I think "tup_conv_maps" looks clear enough.

OK.

In the latest patch:

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ * instead of allocating new ones while generating the array of all leaf
+ * partition result rels.

Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"

>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
>> interface. I guess it could simply have the following interface:
>>
>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>> HeapTuple tuple, bool is_update);
>>
>> And figure out, based on the value of is_update, which map to use and
>> which slot to set *p_new_slot to (what is now "new_slot" argument).
>> You're getting mtstate here anyway, which contains all the information you
>> need here. It seems better to make that (selecting which map and which
>> slot) part of the function's implementation if we're having this function
>> at all, imho. Maybe I'm missing some details there, but my point still
>> remains that we should try to put more logic in that function instead of
>> it just do the mechanical tuple conversion.
>
> I tried to see how the interface would look if we do that way. Here is
> how the code looks :
>
> static TupleTableSlot *
> ConvertPartitionTupleSlot(ModifyTableState *mtstate,
> bool for_update_tuple_routing,
> int map_index,
> HeapTuple *tuple,
> TupleTableSlot *slot)
> {
> TupleConversionMap *map;
> TupleTableSlot *new_slot;
>
> if (for_update_tuple_routing)
> {
> map = mtstate->mt_persubplan_childparent_maps[map_index];
> new_slot = mtstate->mt_rootpartition_tuple_slot;
> }
> else
> {
> map = mtstate->mt_perleaf_parentchild_maps[map_index];
> new_slot = mtstate->mt_partition_tuple_slot;
> }
>
> if (!map)
> return slot;
>
> *tuple = do_convert_tuple(*tuple, map);
>
> /*
> * Change the partition tuple slot descriptor, as per converted tuple.
> */
> ExecSetSlotDescriptor(new_slot, map->outdesc);
> ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);
>
> return new_slot;
> }
>
> It looks like the interface does not much simplify, and above that, we
> have more number of lines in that function. Also, the caller anyway
> has to be aware whether the map_index is the index into the leaf
> partitions or the update subplans. So it is not like the caller does
> not have to be aware about whether the mapping should be
> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

Hmm, I think we should try to make it so that the caller doesn't have to
be aware of that. And by caller I guess you mean ExecInsert(), which
should not be a place, IMHO, where to try to introduce a lot of new logic
specific to update tuple routing. ISTM, ModifyTableState now has one too
many TupleConversionMap pointer arrays after the patch, creating the need
to choose from in the first place. AIUI -

* mt_perleaf_parentchild_maps:

- each entry is a map to convert root parent's tuples to a given leaf
partition's format

- used to be called mt_partition_tupconv_maps and is needed when tuple-
routing is in use; for both INSERT and UPDATE with tuple-routing

- as many entries in the array as there are leaf partitions and stored
in the partition bound order

* mt_perleaf_childparent_maps:

- each entry is a map to convert a leaf partition's tuples to the root
parent's format

- newly added by this patch and seems to be needed for UPDATE with
tuple-routing for two needs: 1. tuple-routing should start with a
tuple in root parent format whereas the tuple received is in leaf
partition format when ExecInsert() called for update-tuple-routing (by
ExecUpdate), 2. after tuple-routing, we must capture the tuple
inserted into the partition in the transition tuplestore which accepts
tuples in root parent's format

- as many entries in the array as there are leaf partitions and stored
in the partition bound order

* mt_persubplan_childparent_maps:

- each entry is a map to convert a child table's tuples to the root
parent's format

- used to be called mt_transition_tupconv_maps and needed for converting
child tuples to the root parent's format when storing them in the
transition tuplestore which accepts tuples in root parent's format

- as many entries in the array as there are sub-plans in mt_plans and
stored in either the partition bound order or unknown order (the
latter in the regular inheritance case)

I think we could combine the last two into one. The only apparent reason
for them to be separate seems to be that the subplan array might contain
less entries than perleaf array and ExecInsert() has only enough
information to calculate the offset of a map in the persubplan array.
That is, resultRelInfo of leaf partition that ExecInsert starts with in
the update-tuple-routing case comes from mtstate->resultRelInfo array
which contains only mt_nplans entries. So, if we only have the array with
entries for *all* partitions, it's hard to get the offset of the map to
use in that array.

I suggest we don't add a new map array and a significant amount of new
code to initialize the same and to implement the logic to choose the
correct array to get the map from. Instead, we could simply add an array
of integers with mt_nplans entries. Each entry is an offset of a given
sub-plan in the array containing entries of something for *all*
partitions. Since, we are teaching ExecSetupPartitionTupleRouting() to
reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable
place to construct such array. Let's say the array is called
mt_subplan_partition_offsets[]. Let ExecSetupPartitionTupleRouting() also
initialize the parent-to-partition maps for *all* partitions, in the
update-tuple-routing case. Then add a quick-return check in
ExecSetupTransitionCaptureState() to see if the map has already been set
by ExecSetupPartitionTupleRouting(). Since we're using the same map for
two purposes, we could rename mt_transition_tupconv_maps to something that
doesn't bind it to its use only for transition tuple capture.

With that, now there are no persubplan and perleaf arrays for ExecInsert()
to pick from to select a map to pass to ConvertPartitionTupleSlot(), or
maybe even no need for the separate function. The tuple-routing code
block in ExecInsert would look like below (writing resultRelInfo as just Rel):

rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel

if (rootRel != Rel) /* update tuple-routing active */
{
int subplan_off = Rel - mtstate->Rel[0];
int leaf_off = mtstate->mt_subplan_partition_offsets[subplan_off];

if (mt_transition_tupconv_maps[leaf_off])
{
/*
* Convert to root format using
* mt_transition_tupconv_maps[leaf_off]
*/

slot = mt_root_tuple_slot; /* for tuple-routing */

/* Store the converted tuple into slot */
}
}

/* Existing tuple-routing flow follows */
new_leaf = ExecFindPartition(rootRel, slot, ...)

if (mtstate->transition_capture)
{
transition_capture_map = mt_transition_tupconv_maps[new_leaf]
}

if (mt_partition_tupconv_maps[new_leaf])
{
/*
* Convert to leaf format using mt_partition_tupconv_maps[new_leaf]
*/

slot = mt_partition_tuple_slot;

/* Store the converted tuple into slot */
}

>> ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
>> try to reuse the per-subplan child-to-parent map as per-leaf
>> child-to-parent map could be simplified a bit. I mean the following code:
>>
>> + /*
>> + * But for Updates, we can share the per-subplan maps with the per-leaf
>> + * maps.
>> + */
>> + update_rri_index = 0;
>> + update_rri = mtstate->resultRelInfo;
>> + if (mtstate->mt_nplans > 0)
>> + cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
>>
>> - /* Choose the right set of partitions */
>> - if (mtstate->mt_partition_dispatch_info != NULL)
>> + for (i = 0; i < numResultRelInfos; ++i)
>> + {
>> <snip>
>>
>> How about (pseudo-code):
>>
>> j = 0;
>> for (i = 0; i < n_leaf_parts; i++)
>> {
>> if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
>> {
>> leaf_childparent_map[i] = subplan_childparent_map[j];
>> j++;
>> }
>> else
>> {
>> leaf_childparent_map[i] = new map
>> }
>> }
>>
>> I think the above would also be useful in ExecSetupPartitionTupleRouting()
>> where you've added similar code to try to reuse per-subplan ResultRelInfos.
>
> Did something like that in the attached patch. Please have a look.
> After we conclude on that, will do the same for
> ExecSetupPartitionTupleRouting() as well.

Yeah, ExecSetupTransitionCaptureState() looks better in v22, but as I
explained above, we may not need to change the function so much. The
approach, OTOH, should be adopted for ExecSetupPartitionTupleRouting().

>> In the following:
>>
>> ExecSetupPartitionTupleRouting(rel,
>> + (operation == CMD_UPDATE ?
>> + mtstate->resultRelInfo : NULL),
>> + (operation == CMD_UPDATE ? nplans
>> : 0),
>>
>> Can the second parameter be made to not span two lines? It was a bit hard
>> for me to see that there two new parameters.
>
> I think it is safe to just pass mtstate->resultRelInfo. Inside
> ExecSetupPartitionTupleRouting() we should anyways check only the
> nplans param (and not update_rri) to decide whether it is for insert
> or update. So did the same.

OK.

>> By the way, I've seen in a number of places that the patch calls "root
>> table" a partition. Not just in comments, but also a variable appears to
>> be given a name which contains rootpartition. I can see only one instance
>> where root is called a partition in the existing source code, but it seems
>> to have been introduced only recently:
>>
>> allpaths.c:1333: * A root partition will already have a
>
> Changed to either this :
> root partition => root partitioned table
> or this if we have to refer to it too often :
> root partition => root

That seems fine, thanks.

On 2017/10/25 15:10, Amit Khandekar wrote:
> On 16 October 2017 at 08:28, Amit Langote wrote:
>> In ExecInitModifyTable(), can we try to minimize the number of places
>> where update_tuple_routing_needed is being set. Currently, it's being set
>> in 3 places:
>
> I think the way it's done seems ok. For each resultRelInfo,
> update_tuple_routing_needed is updated in case that resultRel has
> partition cols changed. And at that point, we don't have rel opened,
> so we can't check if that rel is partitioned. So another check is
> required outside of the loop.

I understood why now.

>> + * qual for each partition. Note that, if there are SubPlans in
>> there,
>> + * they all end up attached to the one parent Plan node.
>>
>> The sentence starting with "Note that, " is a bit unclear.
>>
>> + Assert(update_tuple_routing_needed ||
>> + (operation == CMD_INSERT &&
>> + list_length(node->withCheckOptionLists) == 1 &&
>> + mtstate->mt_nplans == 1));
>>
>> The comment I complained about above is perhaps about this Assert.
>
> That is an existing comment.

Sorry, my bad.

> On HEAD, the "parent Plan" refers to
> mtstate->mt_plans[0]. Now in the patch, for the parent node in
> ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
> parent plan refers to this mtstate node.

Hmm, I'm not really sure if doing that (passing mtstate->ps) would be
accurate. In the update tuple routing case, it seems that it's better to
pass the correct parent PlanState pointer to ExecInitQual(), that is, one
corresponding to the partition's sub-plan. At least I get that feeling by
looking at how parent is used downstream to that ExecInitQual() call, but
there *may* not be anything to worry about there after all. I'm unsure.

> BTW, the reason I had changed the parent node to mtstate->ps is :
> Other places in that code use mtstate->ps while initializing
> expressions :
>
> /*
> * Build a projection for each result rel.
> */
> resultRelInfo->ri_projectReturning =
> ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
> resultRelInfo->ri_RelationDesc->rd_att);
>
> ...........
>
> /* build DO UPDATE WHERE clause expression */
> if (node->onConflictWhere)
> {
> ExprState *qualexpr;
>
> qualexpr = ExecInitQual((List *) node->onConflictWhere,
> &mtstate->ps);
> ....
> }
>
> I think wherever we initialize expressions belonging to a plan, we
> should use that plan as the parent. WithCheckOptions are fields of
> ModifyTableState.

You may be right, but I see for WithCheckOptions initialization
specifically that the non-tuple-routing code passes the actual sub-plan
when initializing the WCO for a given result rel.

>> Comments on the optimizer changes:
>>
>> +get_all_partition_cols(List *rtables,
>>
>> Did you mean rtable?
>
> I did mean rtables. It's a list of rtables.

It's not, AFAIK. rtable (range table) is a list of range table entries,
which is also what seems to get passed to get_all_partition_cols for that
argument (root->parse->rtable, which is not a list of lists).

Moreover, there are no existing instances of this naming within the
planner other than those that this patch introduces:

$ grep rtables src/backend/optimizer/
planner.c:114: static void get_all_partition_cols(List *rtables,
planner.c:1063: get_all_partition_cols(List *rtables,
planner.c:1069: Oid root_relid = getrelid(root_rti, rtables);
planner.c:1078: Oid relid = getrelid(rti, rtables);

OTOH, dependency.c does have rtables, but it's actually a list of range
tables. For example:

dependency.c:1360: context.rtables = list_make1(rtable);

>> + if (partattno != 0)
>> + child_keycols =
>> + bms_add_member(child_keycols,
>> + partattno -
>> FirstLowInvalidHeapAttributeNumber);
>> + }
>> + foreach(lc, partexprs)
>> + {
>>
>> Elsewhere (in quite a few places), we don't iterate over partexprs
>> separately like this, although I'm not saying it is bad, just different
>> from other places.
>
> I think you are suggesting we do it like how it's done in
> is_partition_attr(). Can you please let me know other places we do
> this same way ? I couldn't find.

OK, not as many as I thought there would be, but there are following
beside is_partition_attrs():

partition.c: get_range_nulltest()
partition.c: get_qual_for_range()
relcache.c: RelationBuildPartitionKey()

>> Aha, so here's where all_part_cols was being set before...
>
> Yes, and we used to have PartitionedChildRelInfo.all_part_cols field
> for that. We used to populate that while traversing through the
> partition tree in expand_inherited_rtentry(). I agreed with Dilip's
> opinion that this would unnecessarily add up some processing even when
> the query is not a DML. And also, we don't have to have
> PartitionedChildRelInfo.all_part_cols. For the earlier implementation,
> check v18 patch or earlier versions.

Hmm, I think I have to agree with both you and Dilip that that would add
some redundant processing to other paths.

> Attached v22 patch.

Thanks again.

Regards,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-11-02 07:19:57 Re: SQL/JSON in PostgreSQL
Previous Message Tsunakawa, Takayuki 2017-11-02 06:16:44 Re: Statement-level rollback