Re: UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-09-20 15:57:10
Message-ID: CAJ3gD9cCvDgor3gVXszzxni+hyhbX7bxkhQu=sN7Sy9fKNxBQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20 September 2017 at 00:06, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> [ new patch ]
>
> This already fails to apply again. In general, I think it would be a
> good idea to break this up into a patch series rather than have it as
> a single patch. That would allow some bits to be applied earlier.
> The main patch will probably still be pretty big, but at least we can
> make things a little easier by getting some of the cleanup out of the
> way first. Specific suggestions on what to break out below.
>
> If the changes to rewriteManip.c are a marginal efficiency hack and
> nothing more, then let's commit this part separately before the main
> patch. If they're necessary for correctness, then please add a
> comment explaining why they are necessary.

Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
the other. But that was not causing any correctness issue. Will
extract these changes into separate patch.

>
> There appears to be no reason why the definitions of
> GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
> header file as a result of this patch. GetUpdatedColumns() was
> previously defined in trigger.c and execMain.c and, post-patch, is
> still called from only those files. GetInsertedColumns() was, and
> remains, called only from execMain.c. If this were needed I'd suggest
> doing it as a preparatory patch before the main patch, but it seems we
> don't need it at all.

In earlier versions of the patch, these functions were used in
nodeModifyTable.c as well. Now that those calls are not there in this
file, I will revert back the changes done for moving the definitions
into header file.

>
> If I understand correctly, the reason for changing mt_partitions from
> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
> ResultRelInfos for a partitioning hierarchy are allocated as a single
> chunk, but we can't do that and also reuse the ResultRelInfos created
> during InitPlan. I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

> Someone could argue that this is going the wrong way and that we ought
> to instead make InitPlan() create all of the necessarily
> ResultRelInfos, but it seems to me that eventually we probably want to
> allow setting up ResultRelInfos on the fly for only those partitions
> for which we end up needing them. The code already has some provision
> for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
> I don't think it's this patch's job to try to apply that kind of thing
> to tuple routing, but it seems like in the long run if we're inserting
> 1 tuple into a table with 1000 partitions, or performing 1 update that
> touches the partition key, it would be best not to create
> ResultRelInfos for all 1000 partitions just for fun.

Yes makes sense.

> But this sort of
> thing seems much easier of mt_partitions is ResultRelInfo ** rather
> than ResultRelInfo *, so I think what you have is going in the right
> direction.

Ok.

>
> + * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
> + * does not belong to subplans, then it already matches the root tuple
> + * descriptor; although there is no such known scenario where this
> + * could happen.
> + */
> + if (rootResultRelInfo != resultRelInfo &&
> + mtstate->mt_persubplan_childparent_maps != NULL &&
> + resultRelInfo >= mtstate->resultRelInfo &&
> + resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
> + {
> + int map_index = resultRelInfo - mtstate->resultRelInfo;
>
> I think you should Assert() that it doesn't happen instead of assuming
> that it doesn't happen. IOW, remove the last two branches of the
> if-condition, and then add an Assert() that map_index is sane.

Ok.

>
> It is not clear to me why we need both mt_perleaf_childparent_maps and
> mt_persubplan_childparent_maps.

mt_perleaf_childparent_maps :
This is used for converting transition-captured
inserted/modified/deleted tuples from leaf to root partition, because
we need to have all the ROWS in the root partition attribute order.
This map is used only for tuples that are routed from root to leaf
partition during INSERT, or when tuples are routed from one leaf
partition to another leaf partition during update row movement. For
both of these operations, we need per-leaf maps, because during tuple
conversion, the source relation is among the mtstate->mt_partitions.

mt_persubplan_childparent_maps :
This is used at two places :

1. After an ExecUpdate() updates a row of a per-subplan update result
rel, we need to capture the tuple, so again we need to convert to the
root partition. Here, the source table is a per-subplan update result
rel; so we need to have per-subplan conversion map array. So after
UPDATE finishes with one update result rel,
node->mt_transition_capture->tcs_map shifts to the next element in the
mt_persubplan_childparent_maps array. :
ExecModifyTable()
{
....
node->mt_transition_capture->tcs_map =
node->mt_persubplan_childparent_maps[node->mt_whichplan];
....
}

2. In ExecInsert(), if it is part of update tuple routing, we need to
convert the tuple from the update result rel to the root partition. So
it re-uses this same conversion map.

Now, instead of these two maps having separate allocations, I have
arranged for the per-leaf map array to re-use the mapping allocations
made by per-subplan array elements, similar to how we are doing for
re-using the ResultRelInfos. But still the arrays themselves need to
be separate.

>
> + * Note: if the UPDATE is converted into a DELETE+INSERT as part of
> + * update-partition-key operation, then this function is also called
> + * separately for DELETE and INSERT to capture transition table rows.
> + * In such case, either old tuple or new tuple can be NULL.
>
> That seems pretty strange. I don't quite see how that's going to work
> correctly. I'm skeptical about the idea that the old tuple capture
> and new tuple capture can safely happen at different times.

Actually the tuple capture involves just adding the tuple into the
correct tuplestore for a particular event. There is no trigger event
added for tuple capture. Calling ExecARUpdateTriggers() with either
newtuple NULL or tupleid Invalid makes sure that it does not do
anything other than transition capture :

@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate,
ResultRelInfo *relinfo,
/* If transition tables are the only reason we're
here, return. */
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE &&
!trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT &&
!trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && (oldtup ==
NULL || newtup == NULL)))
return;

Even if we imagine a single place or a single function that we could
call to do the OLD and NEW row capture, still the end result is going
to be the same : OLD row would go into
mtstate->mt_transition_capture->tcs_old_tuplestore, and NEW row would
end up in mtstate->mt_transition_capture->tcs_update_tuplestore. Note
that these are common tuple stores for all the partitions of the
partition tree.

(Actually I am still rebasing my patch over the recent changes where
tcs_update_tuplestore no more exists; instead we need to use
transition_capture->tcs_private->new_tuplestore).

When we access the OLD and NEW tables for UPDATE trigger, there is no
longer a co-relation as to which row of OLD TABLE correspond to which
row of the NEW TABLE for a given updated row. So, at exactly which
point OLD row and NEW row gets captured into their respective
tuplestores, and in which order, is not important.

Whereas, for the usual per ROW triggers, it is critical that the
trigger event has both the OLD and NEW row together in the same
trigger event, since they need to be both accessible in the same
trigger function.

Doing the OLD and NEW tables row capture separately is essential
because the DELETE and INSERT happen on different tables, so we are
not even sure if the insert is going to happen (thanks to triggers on
partitions, if any). If the insert is skipped, we should not capture
that tuple.

>
> I wonder if we should have a reloption controlling whether
> update-tuple routing is enabled. I wonder how much more expensive it
> is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
> 1000 subpartitions with this patch than without, assuming the update
> succeeds in both cases.

You mean to check how much the patch slows down things for the
existing updates involving no row movement ? And so have the reloption
to have an option to disable the logic that slows down things ?

>
> I also wonder how efficient this implementation is in general. For
> example, suppose you make a table with 1000 partitions each containing
> 10,000 tuples and update them all, and consider three scenarios: (1)
> partition key not updated but all tuples subject to non-HOT updates
> because the updated column is indexed, (2) partition key updated but
> no tuple movement required as a result, (3) partition key updated and
> all tuples move to a different partition. It would be useful to
> compare the times, and also to look at perf profiles and see if there
> are any obvious sources of inefficiency that can be squeezed out. It
> wouldn't surprise me if tuple movement is a bit slower than the other
> scenarios, but it would be nice to know how much slower and whether
> the bottlenecks are anything that we can easily fix.

Ok yeah that would be helpful to remove any unnecessary slowness that
may have been caused due to the patch; will do.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-20 15:57:58 Re: Varying results when using merge joins over postgres_fdw vs hash joins
Previous Message Ashutosh Sharma 2017-09-20 15:43:42 Re: Page Scan Mode in Hash Index