Quick Links

Re: [HACKERS] UPDATE of partition key

From:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] UPDATE of partition key
Date:	2017-11-16 14:50:24
Message-ID:	CAJ3gD9fX9fmHwxAL_t4a90OEop3hQ6fOfmBqkh5r7VOyzS-JNw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Thanks David Rowley, Alvaro Herrera and Thomas Munro for stepping in
for the reviews !

In the attached patch v24, I have addressed Amit Langote's remaining
review points, and David Rowley's comments upto point #26.

Yet to address :
Robert's few suggestions.
All of Alvaro's comments.
David's points from #27 to #34.
Thomas's point about adding remaining test coverage on transition tables.

Below has the responses for both Amit's and David's comments, starting
with Amit's ....

===============

On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/10/24 0:15, Amit Khandekar wrote:
>> On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>
>>> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>>> == NULL))))
>>>
>>> Is there some reason why a bitwise operator is used here?
>>
>> That exact condition means that the function is called for transition
>> capture for updated rows being moved to another partition. For this
>> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
>> capture that condition there. I think the bitwise operator is more
>> user-friendly in emphasizing the point that it is indeed an "either a
>> or b, not both" condition.
>
> I see. In that case, since this patch adds the new condition, a note
> about it in the comment just above would be good, because the situation
> you describe here seems to arise only during update-tuple-routing, IIUC.

Done. Please check.

> + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
> + * instead of allocating new ones while generating the array of all leaf
> + * partition result rels.
>
> Instead of:
>
> "These are re-used instead of allocating new ones while generating the
> array of all leaf partition result rels."
>
> how about:
>
> "There is no need to allocate a new ResultRellInfo entry for leaf
> partitions for which one already exists in this array"

Ok. I have made it like this :

+ * 'update_rri' contains the UPDATE per-subplan result rels. For the
output param
+ * 'partitions', we don't allocate new ResultRelInfo objects for
+ * leaf partitions for which they are already available
in 'update_rri'.

>
>>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
>>> interface. I guess it could simply have the following interface:
>>>
>>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>>> HeapTuple tuple, bool is_update);
>>>
>>> And figure out, based on the value of is_update, which map to use and
>>> which slot to set *p_new_slot to (what is now "new_slot" argument).
>>> You're getting mtstate here anyway, which contains all the information you
>>> need here. It seems better to make that (selecting which map and which
>>> slot) part of the function's implementation if we're having this function
>>> at all, imho. Maybe I'm missing some details there, but my point still
>>> remains that we should try to put more logic in that function instead of
>>> it just do the mechanical tuple conversion.
>>
>> I tried to see how the interface would look if we do that way. Here is
>> how the code looks :
>>
>> static TupleTableSlot *
>> ConvertPartitionTupleSlot(ModifyTableState *mtstate,
>> bool for_update_tuple_routing,
>> int map_index,
>> HeapTuple *tuple,
>> TupleTableSlot *slot)
>> {
>> TupleConversionMap *map;
>> TupleTableSlot *new_slot;
>>
>> if (for_update_tuple_routing)
>> {
>> map = mtstate->mt_persubplan_childparent_maps[map_index];
>> new_slot = mtstate->mt_rootpartition_tuple_slot;
>> }
>> else
>> {
>> map = mtstate->mt_perleaf_parentchild_maps[map_index];
>> new_slot = mtstate->mt_partition_tuple_slot;
>> }
>>
>> if (!map)
>> return slot;
>>
>> *tuple = do_convert_tuple(*tuple, map);
>>
>> /*
>> * Change the partition tuple slot descriptor, as per converted tuple.
>> */
>> ExecSetSlotDescriptor(new_slot, map->outdesc);
>> ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);
>>
>> return new_slot;
>> }
>>
>> It looks like the interface does not much simplify, and above that, we
>> have more number of lines in that function. Also, the caller anyway
>> has to be aware whether the map_index is the index into the leaf
>> partitions or the update subplans. So it is not like the caller does
>> not have to be aware about whether the mapping should be
>> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.
>
> Hmm, I think we should try to make it so that the caller doesn't have to
> be aware of that. And by caller I guess you mean ExecInsert(), which
> should not be a place, IMHO, where to try to introduce a lot of new logic
> specific to update tuple routing.

I think, for ExecInsert() since we have already given the job of
routing the tuple from root partitioned table to a partition, it makes
sense to give the function the additional job of routing the tuple
from any partition to any partition. ExecInsert() can be looked at as
doing this job : "insert a tuple into the right partition; the
original tuple can belong to any partition"

> With that, now there are no persubplan and perleaf arrays for ExecInsert()
> to pick from to select a map to pass to ConvertPartitionTupleSlot(), or
> maybe even no need for the separate function. The tuple-routing code
> block in ExecInsert would look like below (writing resultRelInfo as just Rel):
>
> rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel
>
> if (rootRel != Rel) /* update tuple-routing active */
> {
> int subplan_off = Rel - mtstate->Rel[0];
> int leaf_off = mtstate->mt_subplan_partition_offsets[subplan_off];
>
> if (mt_transition_tupconv_maps[leaf_off])
> {
> /*
> * Convert to root format using
> * mt_transition_tupconv_maps[leaf_off]
> */
>
> slot = mt_root_tuple_slot; /* for tuple-routing */
>
> /* Store the converted tuple into slot */
> }
> }
>
> /* Existing tuple-routing flow follows */
> new_leaf = ExecFindPartition(rootRel, slot, ...)
>
> if (mtstate->transition_capture)
> {
> transition_capture_map = mt_transition_tupconv_maps[new_leaf]
> }
>
> if (mt_partition_tupconv_maps[new_leaf])
> {
> /*
> * Convert to leaf format using mt_partition_tupconv_maps[new_leaf]
> */
>
> slot = mt_partition_tuple_slot;
>
> /* Store the converted tuple into slot */
> }
>

After doing the changes for the int[] array map in the previous patch
version, I still feel that ConvertPartitionTupleSlot() should be
retained. We save some repeated lines of code saved.

>> On HEAD, the "parent Plan" refers to
>> mtstate->mt_plans[0]. Now in the patch, for the parent node in
>> ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
>> parent plan refers to this mtstate node.
>
> Hmm, I'm not really sure if doing that (passing mtstate->ps) would be
> accurate. In the update tuple routing case, it seems that it's better to
> pass the correct parent PlanState pointer to ExecInitQual(), that is, one
> corresponding to the partition's sub-plan. At least I get that feeling by
> looking at how parent is used downstream to that ExecInitQual() call, but
> there *may* not be anything to worry about there after all. I'm unsure.
>
>> BTW, the reason I had changed the parent node to mtstate->ps is :
>> Other places in that code use mtstate->ps while initializing
>> expressions :
>>
>> /*
>> * Build a projection for each result rel.
>> */
>> resultRelInfo->ri_projectReturning =
>> ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
>> resultRelInfo->ri_RelationDesc->rd_att);
>>
>> ...........
>>
>> /* build DO UPDATE WHERE clause expression */
>> if (node->onConflictWhere)
>> {
>> ExprState *qualexpr;
>>
>> qualexpr = ExecInitQual((List *) node->onConflictWhere,
>> &mtstate->ps);
>> ....
>> }
>>
>> I think wherever we initialize expressions belonging to a plan, we
>> should use that plan as the parent. WithCheckOptions are fields of
>> ModifyTableState.
>
> You may be right, but I see for WithCheckOptions initialization
> specifically that the non-tuple-routing code passes the actual sub-plan
> when initializing the WCO for a given result rel.

Yes that's true. The problem with WithCheckOptions for newly allocated
partition result rels is : we can't use a subplan for the parent
parameter because there is no subplan for it. But I will still think
on it a bit more (TODO).

>
>>> Comments on the optimizer changes:
>>>
>>> +get_all_partition_cols(List *rtables,
>>>
>>> Did you mean rtable?
>>
>> I did mean rtables. It's a list of rtables.
>
> It's not, AFAIK. rtable (range table) is a list of range table entries,
> which is also what seems to get passed to get_all_partition_cols for that
> argument (root->parse->rtable, which is not a list of lists).
>
> Moreover, there are no existing instances of this naming within the
> planner other than those that this patch introduces:
>
> $ grep rtables src/backend/optimizer/
> planner.c:114: static void get_all_partition_cols(List *rtables,
> planner.c:1063: get_all_partition_cols(List *rtables,
> planner.c:1069: Oid root_relid = getrelid(root_rti, rtables);
> planner.c:1078: Oid relid = getrelid(rti, rtables);
>
> OTOH, dependency.c does have rtables, but it's actually a list of range
> tables. For example:
>
> dependency.c:1360: context.rtables = list_make1(rtable);

Yes, Ok. To be consistent with naming convention at multiple places, I
have changed it to rtable.

>
>>> + if (partattno != 0)
>>> + child_keycols =
>>> + bms_add_member(child_keycols,
>>> + partattno -
>>> FirstLowInvalidHeapAttributeNumber);
>>> + }
>>> + foreach(lc, partexprs)
>>> + {
>>>
>>> Elsewhere (in quite a few places), we don't iterate over partexprs
>>> separately like this, although I'm not saying it is bad, just different
>>> from other places.
>>
>> I think you are suggesting we do it like how it's done in
>> is_partition_attr(). Can you please let me know other places we do
>> this same way ? I couldn't find.
>
> OK, not as many as I thought there would be, but there are following
> beside is_partition_attrs():
>
> partition.c: get_range_nulltest()
> partition.c: get_qual_for_range()
> relcache.c: RelationBuildPartitionKey()
>

Ok, I think I will first address Robert's suggestion of re-using
is_partition_attrs() for pull_child_partition_columns(). If I do that,
this discussion won't be applicable, so I am deferring this one.
(TODO)

=============

Below are my responses to David's comments upto point #26 :

On 13 November 2017 at 18:25, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> [ update-partition-key_v23.patch ]
>
> Hi Amit,
>
> Thanks for working on this. I'm looking forward to seeing this go in.
>
> So... I've signed myself up to review the patch, and I've just had a
> look at it, (after first reading this entire email thread!).

Thanks a lot for your extensive review.

>
> Overall the patch looks like it's in quite a good shape.

Nice to hear that.

> I think I do agree with Robert about the UPDATE anomaly that's been discussed.
> I don't think we're painting ourselves into any corner by not having
> this working correctly right away. Anyone who's using some trigger
> workarounds for the current lack of support for updating the partition
> key is already going to have the same issues, so at least this will
> save them some troubles implementing triggers and give them much
> better performance.

I believe you are referring to the concurrency anomaly. Yes I agree on
that. By the way, (you may be already aware), there is a separate mail
thread going on to address this anamoly, so that we don't silently
proceed with the UPDATE without error :

https://www.postgresql.org/message-id/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw%40mail.gmail.com

> 1. Closing command tags in docs should not be abbreviated
>
> triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
>
> This changed in c29c5789. I think Peter will be happy if you don't
> abbreviate the closing tags.

Added the tag. I had done most of the corrections after I rebased over
this commit, but I think I missed some of those with <literal> tag.

>
> 2. "about to do" would read better as "about to perform"
>
> concurrent session, and it is about to do an <command>UPDATE</command>
>
> I think this paragraph could be more clear if we identified the
> sessions with a number.
>
> Perhaps:
> Suppose, session 1 is performing an <command>UPDATE</command> on a
> partition key, meanwhile, session 2 tries to perform an <command>UPDATE
> </command> or <command>DELETE</command> operation on the same row.
> Session 2 can silently miss the row due to session 1's activity. In
> such a case, session 2's <command>UPDATE</command>/<command>DELETE
> </command>, being unaware of the row's movement, interprets this that the
> row has just been deleted, so there is nothing to be done for this row.
> Whereas, in the usual case where the table is not partitioned, or where
> there is no row movement, the second session would have identified the
> newly updated row and carried <command>UPDATE</command>/<command>DELETE
> </command> on this new row version.

Done like above, with slight changes.

>
>
> 3. Integer width. get_partition_natts returns int but we assign to int16.
>
> int16 partnatts = get_partition_natts(key);
>
> Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
> but that's existingly not correct.
>
> 4. The following code could just pull_varattnos(partexprs, 1, &child_keycols);
>
> foreach(lc, partexprs)
> {
> Node *expr = (Node *) lfirst(lc);
>
> pull_varattnos(expr, 1, &child_keycols);
> }

I will defer this till I address Robert's request to try and see if we
can have a common code for pull_child_partition_columns() and
is_partition_attr(). (TODO)

>
> 5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
> do something
> special when the DELETE/INSERT is a partition move? I have audit
> tables in mind here
> it may appear as though a user performed a DELETE when they actually
> performed an UPDATE
> giving visibility of this to the trigger function will allow the
> application to work around this.

I feel it's too early to add a user-visible variable for such purpose.
Currently we don't support triggers on partitioned tables, and so a
user who wants to have a common trigger for a partition subtree has no
choice but to install the same trigger on all the leaf partitions
under it. And so we have to live with a not-very-obvious behaviour of
firing triggers even for the delete/insert part of the update row
movement.

>
> 6. change "row" to "a row" and "old" to "the old"
>
> * depending on whether the event is for row being deleted from old
>
> But to be honest, I'm having trouble parsing the comment. I think it
> would be better to
> say explicitly when the row will be NULL rather than "depending on
> whether the event"

I have put it this way now :

* For INSERT events newtup should be non-NULL, for DELETE events
* oldtup should be non-NULL, whereas for UPDATE events normally both
* oldtup and newtup are non-NULL. But for UPDATE event fired for
* capturing transition tuples during UPDATE partition-key row
* movement, oldtup is NULL when the event is for row being inserted,
* whereas newtup is NULL when the event is for row being deleted.

>
> 7. I'm confused with how this change came about. If the old comment
> was correct here then the comment you're referring to here should
> remain in ExecPartitionCheck(), but you're saying it's in
> ExecConstraints().
>
> /* See the comments in ExecConstraints. */
>
> If the comment really is in ExecConstraints(), then you might want to
> give an overview of what you mean, then reference ExecConstraints() if
> more details are required.

I have put it this way :
* Need to first convert the tuple to the root partitioned table's row
* type. For details, check similar comments in ExecConstraints().

Basically, the comment to be referred in ExecConstraints() is this :
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.

>
> 8. I'm having trouble parsing this comment:
>
> * 'update_rri' has the UPDATE per-subplan result rels.
>
> I think "has" should be "contains" ?

Ok, changed it to 'contains'.

>
> 9. Also, this should likely be reworded:
>
> * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
> * this is 0.
>
> 'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

Done.

>
> 10. There should be no space before the '?'
>
> /* Is this leaf partition present in the update resultrel ? */

Done.

>
> 11. I'm struggling to understand this comment:
>
> * This is required when converting tuple as per root
> * partition tuple descriptor.
>
> "tuple" should probably be "the tuple", but not quite sure what you
> mean by "as per root".
>
> I may have misunderstood, but maybe it should read:
>
> * This is required when we convert the partition's tuple to
> * be compatible with the partitioned table's tuple descriptor.

ri_PartitionRoot is set to NULL while creating the result rels for
each of the UPDATE subplans; and it is required to be set to the root
table for leaf partitions created for tuple routing so that the error
message displays the row in root tuple descriptor. Because we re-use
the same result rels for the per-partition array, we need to set it
for them here.

I have reworded the comment this way :

* This is required when we convert the partition's tuple to be
* compatible with the root partitioned table's tuple
* descriptor. When generating the per-subplan UPDATE result
* rels, this was not set.

Let me know if this is clear enough.

>
> 12. I think "as well" would be better written as "either".
>
> * If we didn't open the partition rel, it means we haven't
> * initialized the result rel as well.

Done.

>
> 13. I'm unsure what is meant by the following comment:
>
> * Verify result relation is a valid target for insert operation. Even
> * for updates, we are doing this for tuple-routing, so again, we need
> * to check the validity for insert operation.
>
> I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

Here, "Even for update" means "Even when
ExecSetupPartitionTupleRouting() is called for an UPDATE operation".

>
> 14. Use of underscores instead of camelCase.
>
> COPY_SCALAR_FIELD(part_cols_updated);

>
> I know you're not the first one to break this as "partitioned_rels"
> does not follow it either, but that's probably not a good enough
> reason to break away from camelCase any further.
>
> I'd suggest "partColsUpdated". But after a re-think, maybe cols is
> incorrect. All columns are partitioned, it's the key columns that we
> care about, so how about "partKeyUpdate"

Sure. I have used partKeyUpdated as against partKeyUpdate.

>
> 15. Are you sure that you mean "root" here?
>
> * All the child partition attribute numbers are converted to the root
> * partitioned table.
>
> Surely this is just the target relation. "parent" maybe? A
> sub-partitioned table might be the target of an UPDATE too.

Here the root means the root of the partition subtree, which is also
the UPDATE target relation. I think in other places we call it the
root even though it may also have ancestors. It is the root of the
subtree in question. This is similar to how we have named the
ModifyTableState->rootResultRelInfo field.

Note that Robert has requested to collect the partition cols at some
other place where we have already the table open. So this function
itself may change.

>
> 15. I see get_all_partition_cols() is just used once to check if
> parent_rte->updatedCols contains and partition keys.
>
> Would it not be better to reform that function and pass
> parent_rte->updatedCols in and abort as soon as you see a single
> match?
>
> Maybe the function could return bool and be named
> partitioned_key_overlaps(), that way your assignment in
> inheritance_planner() would just become:
>
> part_cols_updated = partitioned_key_overlaps(root->parse->rtable,
> top_parentRTindex, partitioned_rels, parent_rte->updatedCols);
>
> or something like that anyway.

I am going to think on all of this when I start checking if we can
have some common code for pull_child_partition_columns() and
is_partition_attr(). (TODO)

One thing to note is : Usually the user is not going to modify
partition cols. So typically we would need to scan through all the
partitioned tables to check if the partition key is modified. So to
make this scan more efficient, avoid the "bitmap_overlap" operation
for each of the partitioned tables separately, and instead, collect
them first from all partitioned tables, and then do a single overlap
operation. This way we make the normal updates a tiny bit fast, at the
expense of tiny-bit slower partition-key-updates because we don't
abort the scan as soon as we find the partition key updated.

>
> 16. Typo in comment
>
> * 'part_cols_updated' if any partitioning columns are being updated, either
> * from the named relation or a descendent partitione table.
>
> "partitione" should be "partitioned". Also, normally for bool
> parameters, we might word things like "True if ..." rather than just "if"
>
> You probably should follow camelCase I mentioned in 14 here too.

Done. Similar to the other bool param canSetTag, made it :
"'partColsUpdated' is true if any ..."

>
> 17. Comment needs a few changes:
>
> * ConvertPartitionTupleSlot -- convenience function for converting tuple and
> * storing it into a tuple slot provided through 'new_slot', which typically
> * should be one of the dedicated partition tuple slot. Passes the partition
> * tuple slot back into output param p_old_slot. If no mapping present, keeps
> * p_old_slot unchanged.
> *
> * Returns the converted tuple.
>
> There are a few typos here. For example, "tuple" should be "a tuple",
> but maybe the comment should just be worded like:
>
> * ConvertPartitionTupleSlot -- convenience function for tuple conversion
> * using 'map'. The tuple, if converted, is stored in 'new_slot' and
> * 'p_old_slot' is set to the original partition tuple slot. If map is NULL,
> * then the original tuple is returned unmodified, otherwise the converted
> * tuple is returned.

Modified, with some changes. p_old_slot name is a bit confusing. So I
have renamed it to p_my_slot.
Here is how it looks now :

* ConvertPartitionTupleSlot -- convenience function for tuple conversion using
* 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
* updated with the 'new_slot'. 'new_slot' typically should be one of the
* dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
*
* Returns the converted tuple, unless map is NULL, in which case original
* tuple is returned unmodified.

>
> 18. Line goes over 80 chars.
>
> TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
>
> Better just to split the declaration and assignment.

Done.

>
> 19. Confusing comment:
>
> /*
> * If the original operation is UPDATE, the root partitioned table
> * needs to be fetched from mtstate->rootResultRelInfo.
> */
>
> It's not that clear here how you determine this is an UPDATE of a
> partitioned key.
>
> 20. This code looks convoluted:
>
> rootResultRelInfo = (mtstate->rootResultRelInfo ?
> mtstate->rootResultRelInfo : resultRelInfo);
>
> /*
> * If the resultRelInfo is not the root partitioned table (which
> * happens for UPDATE), we should convert the tuple into root's tuple
> * descriptor, since ExecFindPartition() starts the search from root.
> * The tuple conversion map list is in the order of
> * mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
> * we need to know the position of the resultRel in
> * mtstate->resultRelInfo[].
> */
> if (rootResultRelInfo != resultRelInfo)
> {
>
> rootResultRelInfo is assigned via a ternary expression which makes the
> subsequent if test seem a little strange.
>
> Would it not be better to test:
>
> if (mtstate->rootResultRelInfo)
> {
> rootResultRelInfo = mtstate->rootResultRelInfo
> ... other stuff ...
> }
> else
> rootResultRelInfo = resultRelInfo;
>
> Then above the if test you can explain that rootResultRelInfo is only
> set during UPDATE of partition keys, as per #19.

Giving more thought on this, I think to avoid confusion to the reader,
we better have an explicit (operation == CMD_UPDATE) condition, and in
that block, assert that mtstate->rootResultRelInfo is non-NULL. I have
accordingly shuffled the if conditions. I think this is simple and
clear. Please check.

>
> 21. How come you renamed mt_partition_tupconv_maps[] to
> mt_parentchild_tupconv_maps[]?

mt_transition_tupconv_maps must be renamed to a more general map name
because it is not only used for transition capture but also for update
tuple routing. And we have mt_partition_tupconv_maps which is already
a general name. So to distinguish between the two tupconv maps, I
prepended "parent-child" or "child-parent" to "tupconv_maps".

>
> 22. Comment in ExecInsert() could be worded better.
>
> /*
> * In case this is part of update tuple routing, put this row into the
> * transition NEW TABLE if we are capturing transition tables. We need to
> * do this separately for DELETE and INSERT because they happen on
> * different tables.
> */
>
> /*
> * This INSERT may be the result of a partition-key-UPDATE. If so,
> * and we're required to capture transition tables then we'd better
> * record this as a statement level UPDATE on the target relation.
> * We're not interested in the statement level DELETE or INSERT as
> * these occur on the individual partitions, none of which are the
> * target of this the UPDATE statement.
> */
>
> A similar comment could use a similar improvement in ExecDelete()

I want to emphasize the fact that we need to do the OLD and NEW row
separately for DELETE and INSERT. And also, I think we need not
mention about statement triggers, though the transition table capture
with partitions currently is supported only for statement triggers. We
should only worry about capturing the row if
mtstate->mt_transition_capture != NULL, without having to know whether
it is for statement trigger or not.

Below is how the comment looks now after I did some changes as per
your suggestion about wording :

* If this INSERT is part of a partition-key-UPDATE and we are capturing
* transition tables, put this row into the transition NEW TABLE.
* (Similarly we need to add the deleted row in OLD TABLE). We need to do
* this separately for DELETE and INSERT because they happen on different
* tables.

>
> 23. Line is longer than 80 chars.
>
> TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Done.

>
> 24. I know from reading the thread this name has changed before, but I
> think delete_skipped seems like the wrong name for this variable in:
>
> if (delete_skipped)
> *delete_skipped = true;
>
> Skipped is the wrong word here as that indicates like we had some sort
> of choice and that we decided not to. However, that's not the case
> when the tuple was concurrently deleted. Would it not be better to
> call it "tuple_deleted" or even "success" and reverse the logic? It's
> just a bit confusing that you're setting this to skipped before
> anything happens. It would be nicer if there was a better way to do
> this whole thing as it's a bit of a wart in the code. I understand why
> the code exists though.

I think "success" sounds like : if it is false, ExecDelete has failed.
So I have chosen "tuple_deleted". "tuple_actually_deleted" might sound
still better, but it is too long.

> Also, I wonder if it's better to always pass a boolean here to save
> having to test for NULL before setting it, that way you might consider
> putting the success = false just before the return NULL, then do
> success = true after the tuple is gone.
> Failing that, putting: something like: success = false; /* not yet! */
> where you're doing the if (deleted_skipped) test is might also be
> better.

I didn't really understand this.

>
> 25. Comment "we should" should be "we must".
>
> /*
> * For some reason if DELETE didn't happen (for e.g. trigger
> * prevented it, or it was already deleted by self, or it was
> * concurrently deleted by another transaction), then we should
> * skip INSERT as well, otherwise, there will be effectively one
> * new row inserted.
>
> Maybe just:
> /* If the DELETE operation was unsuccessful, then we must not
> * perform the INSERT into the new partition.

I think we better mention some scenarios of why this can happen ,
otherwise its confusing to the reader why the delete can't happen, or
why we shouldn't error out in that case.

>
> "for e.g." is not really correct in English. "For example, ..." or
> just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
> written "For exempli gratia", which translates to "For for example".

I see. Good to know that. Done.

>
> 26. You're not really explaining what's going on here:
>
> if (mtstate->mt_transition_capture)
> saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> You have a comment later to say you're about to "Revert back to the
> transition capture map", but I missed the part that explained about
> modifying it in the first place.

I have now added main comments while saving the map, and I refer to
this comment while reverting back the map.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
update-partition-key_v24.patch	application/octet-stream	111.8 KB

In response to

Re: UPDATE of partition key at 2017-11-13 12:55:45 from David Rowley

Responses

Re: [HACKERS] UPDATE of partition key at 2017-11-22 10:11:19 from Amit Langote

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ildar Musin	2017-11-16 15:09:37	Re: [HACKERS] Custom compression methods
Previous Message	amul sul	2017-11-16 14:37:07	Re: [COMMITTERS] pgsql: Add hash partitioning.