Quick Links

Re: UPDATE of partition key

From:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UPDATE of partition key
Date:	2017-10-23 15:15:25
Message-ID:	CAJ3gD9ewLB2i4ojRScTt=H_ruoY_eKLxkn+rsge=DfdHqxLs7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:

>
> + * the transition tuplestores can be built. Furthermore, if the transition
> + * capture is happening for UPDATEd rows being moved to another
> partition due
> + * partition-key change, then this function is called once when the row is
> + * deleted (to capture OLD row), and once when the row is inserted to
> another
> + * partition (to capture NEW row). This is done separately because
> DELETE and
> + * INSERT happen on different tables.
>
> Extra space at the beginning from the 2nd line onwards.

Just observed that the existing comment lines use tab instead of
spaces. I have now used tab for the new comments, instead of the
multiple spaces.

>
> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
> == NULL))))
>
> Is there some reason why a bitwise operator is used here?

That exact condition means that the function is called for transition
capture for updated rows being moved to another partition. For this
scenario, either the oldtup or the newtup is NULL. I wanted to exactly
capture that condition there. I think the bitwise operator is more
user-friendly in emphasizing the point that it is indeed an "either a
or b, not both" condition.

>
> + * 'update_rri' has the UPDATE per-subplan result rels.
>
> Could you explain why they are being received as input here?

Added the explanation in the comments.

>
> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
> + * with on entry for every leaf partition (required to convert input
> tuple
> + * based on the root table's rowtype to a leaf partition's rowtype after
> + * tuple routing is done)
>
> Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that
> they are maps needed for "tuple conversion". And the other field holding
> the reverse map as leaf_rev_tupconv_maps. Either that or use underscores
> to separate words, but then it gets too long I guess.

In master branch, now this param is already there with the name
"tup_conv_maps". In the rebased version in the earlier mail, I haven't
again changed it. I think "tup_conv_maps" looks clear enough.

>
>
> + tuple = ConvertPartitionTupleSlot(mtstate,
> +
> mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
> +
>
> The 2nd line here seems to have gone over 80 characters.
>
> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
> interface. I guess it could simply have the following interface:
>
> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
> HeapTuple tuple, bool is_update);
>
> And figure out, based on the value of is_update, which map to use and
> which slot to set *p_new_slot to (what is now "new_slot" argument).
> You're getting mtstate here anyway, which contains all the information you
> need here. It seems better to make that (selecting which map and which
> slot) part of the function's implementation if we're having this function
> at all, imho. Maybe I'm missing some details there, but my point still
> remains that we should try to put more logic in that function instead of
> it just do the mechanical tuple conversion.

I tried to see how the interface would look if we do that way. Here is
how the code looks :

static TupleTableSlot *
ConvertPartitionTupleSlot(ModifyTableState *mtstate,
bool for_update_tuple_routing,
int map_index,
HeapTuple *tuple,
TupleTableSlot *slot)
{
TupleConversionMap *map;
TupleTableSlot *new_slot;

if (for_update_tuple_routing)
{
map = mtstate->mt_persubplan_childparent_maps[map_index];
new_slot = mtstate->mt_rootpartition_tuple_slot;
}
else
{
map = mtstate->mt_perleaf_parentchild_maps[map_index];
new_slot = mtstate->mt_partition_tuple_slot;
}

if (!map)
return slot;

*tuple = do_convert_tuple(*tuple, map);

/*
* Change the partition tuple slot descriptor, as per converted tuple.
*/
ExecSetSlotDescriptor(new_slot, map->outdesc);
ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);

return new_slot;
}

It looks like the interface does not much simplify, and above that, we
have more number of lines in that function. Also, the caller anyway
has to be aware whether the map_index is the index into the leaf
partitions or the update subplans. So it is not like the caller does
not have to be aware about whether the mapping should be
mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

>
> + * We have already checked partition constraints above, so skip them
> + * below.
>
> How about: ", so skip checking here."?

Ok I have made it this way :
* We have already checked partition constraints above, so skip
* checking them here.

>
>
> ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
> try to reuse the per-subplan child-to-parent map as per-leaf
> child-to-parent map could be simplified a bit. I mean the following code:
>
> + /*
> + * But for Updates, we can share the per-subplan maps with the per-leaf
> + * maps.
> + */
> + update_rri_index = 0;
> + update_rri = mtstate->resultRelInfo;
> + if (mtstate->mt_nplans > 0)
> + cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
>
> - /* Choose the right set of partitions */
> - if (mtstate->mt_partition_dispatch_info != NULL)
> + for (i = 0; i < numResultRelInfos; ++i)
> + {
> <snip>
>
> How about (pseudo-code):
>
> j = 0;
> for (i = 0; i < n_leaf_parts; i++)
> {
> if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
> {
> leaf_childparent_map[i] = subplan_childparent_map[j];
> j++;
> }
> else
> {
> leaf_childparent_map[i] = new map
> }
> }
>
> I think the above would also be useful in ExecSetupPartitionTupleRouting()
> where you've added similar code to try to reuse per-subplan ResultRelInfos.

Did something like that in the attached patch. Please have a look.
After we conclude on that, will do the same for
ExecSetupPartitionTupleRouting() as well.

>
>
> In ExecInitModifyTable(), can we try to minimize the number of places
> where update_tuple_routing_needed is being set. Currently, it's being set
> in 3 places:

Will see if we can skip some checks (TODO).

> In the following:
>
> ExecSetupPartitionTupleRouting(rel,
> + (operation == CMD_UPDATE ?
> + mtstate->resultRelInfo : NULL),
> + (operation == CMD_UPDATE ? nplans
> : 0),
>
> Can the second parameter be made to not span two lines? It was a bit hard
> for me to see that there two new parameters.

I think it is safe to just pass mtstate->resultRelInfo. Inside
ExecSetupPartitionTupleRouting() we should anyways check only the
nplans param (and not update_rri) to decide whether it is for insert
or update. So did the same.

>
> + * Construct mapping from each of the resultRelInfo attnos to the root
>
> Maybe it's odd to say "resultRelInfo attno", because it's really the
> underlying partition whose attnos we're talking about as being possibly
> different from the root table's attnos.

Changed : resultRelInfo => partition

>
> + * descriptor. In such case we need to convert tuples to the root
>
> s/In such case/In such a case,/

Done.

>
> By the way, I've seen in a number of places that the patch calls "root
> table" a partition. Not just in comments, but also a variable appears to
> be given a name which contains rootpartition. I can see only one instance
> where root is called a partition in the existing source code, but it seems
> to have been introduced only recently:
>
> allpaths.c:1333: * A root partition will already have a

Changed to either this :
root partition => root partitioned table
or this if we have to refer to it too often :
root partition => root

>
> + * qual for each partition. Note that, if there are SubPlans in
> there,
> + * they all end up attached to the one parent Plan node.
>
> The sentence starting with "Note that, " is a bit unclear.
>
> + Assert(update_tuple_routing_needed ||
> + (operation == CMD_INSERT &&
> + list_length(node->withCheckOptionLists) == 1 &&
> + mtstate->mt_nplans == 1));
>
> The comment I complained about above is perhaps about this Assert.
>
> - List *mapped_wcoList;
> + List *mappedWco;
>
> Not sure why this rename. After this rename, it's now inconsistent with
> the code above which handles non-partitioned case, which still calls it
> wcoList. Maybe, because you introduced firstWco and then this line:
>
> + firstWco = linitial(node->withCheckOptionLists);
>
> but note that each member of node->withCheckOptionLists is also a list, so
> the original naming. Also, further below, you're assigning mappedWco to
> a List * field.
>
> + resultRelInfo->ri_WithCheckOptions = mappedWco;
>
>
> Comments on the optimizer changes:
>
> +get_all_partition_cols(List *rtables,
>
> Did you mean rtable?
>
>
> + get_all_partition_cols(root->parse->rtable, top_parentRTindex,
> + partitioned_rels, &all_part_cols);
>
> Two more spaces needed on the 2nd line.
>
>
>
> +void
> +pull_child_partition_columns(Bitmapset **bitmapset,
> + Relation rel,
> + Relation parent)
>
> Nitpick: don't we normally list the output argument(s) at the end? Also,
> "bitmapset" could be renamed to something that conveys what it contains?
>
> + if (partattno != 0)
> + child_keycols =
> + bms_add_member(child_keycols,
> + partattno -
> FirstLowInvalidHeapAttributeNumber);
> + }
> + foreach(lc, partexprs)
> + {
>
> Elsewhere (in quite a few places), we don't iterate over partexprs
> separately like this, although I'm not saying it is bad, just different
> from other places.
>
> get_all_partition_cols() seems to go over the rtable as many times as
> there are partitioned tables in the tree. Is there a way to do this work
> somewhere else? Maybe when the partitioned_rels list is built in the
> first place. But that would require us to make changes to extract
> partition columns in some place (prepunion.c) where it's hard to justify
> why it's being done there at all.
>
>
> + * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
> + * of all partitioning columns used by the partitioned table or any
> + * descendent.
> + *
>
> Dead comment? Aha, so here's where all_part_cols was being set before...
>
> + TupleTableSlot *mt_rootpartition_tuple_slot;
>
> I guess I was complaining about this field where you call root a
> partition. Maybe, mt_root_tuple_slot would suffice.

Will get back with the above comments (TODO)

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
update-partition-key_v21.patch	application/octet-stream	111.8 KB

In response to

Re: UPDATE of partition key at 2017-10-16 02:58:30 from Amit Langote

Responses

Re: UPDATE of partition key at 2017-10-25 06:10:00 from Amit Khandekar
Re: UPDATE of partition key at 2017-11-02 07:10:44 from Amit Langote

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2017-10-23 15:22:47	Re: [sqlsmith] Failed assertion in adjust_appendrel_attrs_mutator
Previous Message	Alvaro Herrera	2017-10-23 15:12:51	Re: Proposal: Local indexes for partitioned table