Re: [HACKERS] UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: [HACKERS] UPDATE of partition key
Date: 2018-01-04 06:18:26
Message-ID: CAJ3gD9dAX5dZXbBAqrC_4+z=XU7B8NMrrmvOw=aTJMMHEM3LQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert, for tracking purpose, below I have consolidated your review
items on which we are yet to conclude. Let me know if you have more
comments on the points which I made.

------------------
1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
------------------

>> + /*
>> + * UPDATEs set the transition capture map only when a new subplan
>> + * is chosen. But for INSERTs, it is set for each row. So after
>> + * INSERT, we need to revert back to the map created for UPDATE;
>> + * otherwise the next UPDATE will incorrectly use the one created
>> + * for INESRT. So first save the one created for UPDATE.
>> + */
>> + if (mtstate->mt_transition_capture)
>> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>>
>> I wonder if there is some more elegant way to handle this problem.
>> Basically, the issue is that ExecInsert() is stomping on
>> mtstate->mt_transition_capture, and your solution is to save and
>> restore the value you want to have there. But maybe we could instead
>> find a way to get ExecInsert() not to stomp on that state in the first
>> place. It seems like the ON CONFLICT stuff handled that by adding a
>> second TransitionCaptureState pointer to ModifyTable, thus
>> mt_transition_capture and mt_oc_transition_capture. By that
>> precedent, we could add mt_utr_transition_capture or similar, and
>> maybe that's the way to go. It seems a bit unsatisfying, but so does
>> what you have now.
>
> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
> triggers referencing transition tables, both of the triggers need to
> independently populate their own transition tables, and hence the need
> for two separate transition states : mt_transition_capture and
> mt_oc_transition_capture. But in case of update-tuple-routing, the
> INSERT statement trigger won't come into picture. So the same
> mt_transition_capture can serve the purpose of populating the
> transition table with OLD and NEW rows. So I think it would be too
> redundant, if not incorrect, to have a whole new transition state for
> update tuple routing.
>
> I will see if it turns out better to have two tcs_maps in
> TransitionCaptureState, one for update and one for insert. But this,
> on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So
upd_del_tcs_maps will be updated only after we start with the next
UPDATE subplan, whereas insert_tcs_maps will keep on getting updated
for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be
used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and
insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But
the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't
know whether this is caused by a normal update or as part of an insert
into new partition during partition-key-update. So blindly using
upd_del_tcs_maps is incorrect. If the event is caused by the later, we
should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not
have the information in trigger.c as to what caused this event.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

------------------
2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
------------------

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan(). So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile. It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by leaf partition number, I think you are worried
here that the map might have been incorrectly indexed by subplan, and
the code might access it partition-wise. Currently we access the map
by leaf-partition-index only when setting up
mtstate->mt_*transition_capture->tcs_map during inserts. At that
place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May
be, we can have another function tupconv_map_for_partition() rather
than directly accessing mt_childparent_tupconv_maps[], and have this
Assert() in that function. What do you say ?

I am more inclined towards avoiding an always-leaf-partition-indexed
map for additional reasons mentioned below ...

> Would it be reasonable to just always do this by partition leaf
> index, even if we don't strictly need that set of mappings?

If there are no transition tables in picture, we don't require
per-leaf child-parent conversion. So, this would mean that the tuple
conversion maps will be set up for all (say, 100) leaf partitions even
if there are only, say, a couple of update plans. I feel this would
unnecessarily increase the startup cost of update-partition-key
operation.

------------------
3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
------------------

>
> Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
> -> mt_childparent_tupconv_maps. That seems like it could similarly be
> left alone.

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

-------------------
4. Explicit signaling for "we are only here for transition tables"
-------------------

>
> + /*
> + * If transition tables are the only reason we're here, return. As
> + * mentioned above, we can also be here during update tuple routing in
> + * presence of transition tables, in which case this function is called
> + * separately for oldtup and newtup, so either can be NULL, not both.
> + */
> if (trigdesc == NULL ||
> (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
> (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
> - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
> + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
>
> I guess this is correct, but it seems awfully fragile. Can't we have
> some more explicit signaling about whether we're only here for
> transition tables, rather than deducing it based on exactly one of
> oldtup and newtup being NULL?

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

But then decided that I can continue on similar lines and add another
such condition to indicate that we are only capturing update-routed
tuples.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-01-04 06:19:52 Re: [HACKERS] pgbench more operators & functions
Previous Message Tom Lane 2018-01-04 06:09:33 Re: pgsql: Add parallel-aware hash joins.