Re: partition routing layering in nodeModifyTable.c

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, amitdkhan(dot)pg(at)gmail(dot)com
Subject: Re: partition routing layering in nodeModifyTable.c
Date: 2019-07-18 05:24:29
Message-ID: CA+HiwqE24u1R+7ZtoF2KS4iwrfu9j9VfzjcKK=q4_jX9Wnu6Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

On Thu, Jul 18, 2019 at 10:09 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> 1) How come partition routing is done outside of ExecInsert()?
>
> case CMD_INSERT:
> /* Prepare for tuple routing if needed. */
> if (proute)
> slot = ExecPrepareTupleRouting(node, estate, proute,
> resultRelInfo, slot);
> slot = ExecInsert(node, slot, planSlot,
> estate, node->canSetTag);
> /* Revert ExecPrepareTupleRouting's state change. */
> if (proute)
> estate->es_result_relation_info = resultRelInfo;
> break;
>
> That already seems like a layering violation,

The decision to move partition routing out of ExecInsert() came about
when we encountered a bug [1] whereby ExecInsert() would fail to reset
estate->es_result_relation_info back to the root table if it had to
take an abnormal path out (early return), of which there are quite a
few instances. The first solution I came up with was to add a goto
label for the code to reset estate->es_result_relation_info and jump
to it from the various places that do an early return, which was
complained about as reducing readability. So, the solution we
eventually settled on in 6666ee49f was to perform ResultRelInfos
switching at a higher level.

> but it's made worse by
> ExecUpdate() having its partition handling solely inside - including another
> call to ExecInsert(), including the surrounding partition setup code.
>
> And even worse, after all that, ExecInsert() still contains partitioning
> code.

AFAIK, it's only to check the partition constraint when necessary.
Partition routing complexity is totally outside, but based on what you
write in point 4 below there's bit more...

> It seems to me that if we just moved the ExecPrepareTupleRouting() into
> ExecInsert(), we could remove the duplication.

I agree that there's duplication here. Given what I wrote above, I
can think of doing this: move all of ExecInsert()'s code into
ExecInsertInternal() and make the former instead look like this:

static TupleTableSlot *
ExecInsert(ModifyTableState *mtstate,
TupleTableSlot *slot,
TupleTableSlot *planSlot,
EState *estate,
bool canSetTag)
{
PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
ResultRelInfo *resultRelInfo = estate->es_result_relation_info;

/* Prepare for tuple routing if needed. */
if (proute)
slot = ExecPrepareTupleRouting(mtstate, estate, proute, resultRelInfo,
slot);

slot = ExecInsertInternal(mtstate, slot, planSlot, estate,
mtstate->canSetTag);

/* Revert ExecPrepareTupleRouting's state change. */
if (proute)
estate->es_result_relation_info = resultRelInfo;

return slot;
}

> 2) The contents of the
> /*
> * If a partition check failed, try to move the row into the right
> * partition.
> */
> if (partition_constraint_failed)
>
> block ought to be moved to a separate function (maybe
> ExecCrossPartitionUpdate or ExecMove). ExecUpdate() is already
> complicated enough without dealing with the partition move.

I tend to agree with this. Adding Amit Khandekar in case he wants to
chime in about this.

> 3) How come we reset estate->es_result_relation_info after partition
> routing, but not the mtstate wide changes by
> ExecPrepareTupleRouting()? Note that its comment says:
>
> * Caller must revert the estate changes after executing the insertion!
> * In mtstate, transition capture changes may also need to be reverted.
>
> ExecUpdate() contains
>
> /*
> * Updates set the transition capture map only when a new subplan
> * is chosen. But for inserts, it is set for each row. So after
> * INSERT, we need to revert back to the map created for UPDATE;
> * otherwise the next UPDATE will incorrectly use the one created
> * for INSERT. So first save the one created for UPDATE.
> */
> if (mtstate->mt_transition_capture)
> saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> but as I read the code, that's not really true? It's
> ExecPrepareTupleRouting() that does so, and that's called directly in ExecUpdate().

Calling ExecPrepareTupleRouting() is considered a part of a given
INSERT operation, so anything it does is to facilitate the INSERT. In
this case, which map to assign to tcs_map can only be determined after
a partition is chosen and determining the partition (routing) is a job
of ExecPrepareTupleRouting(). Perhaps, we need to update the comment
here a bit.

> 4)
> /*
> * If this insert is the result of a partition key update that moved the
> * tuple to a new partition, put this row into the transition NEW TABLE,
> * if there is one. We need to do this separately for DELETE and INSERT
> * because they happen on different tables.
> */
> ar_insert_trig_tcs = mtstate->mt_transition_capture;
> if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
> && mtstate->mt_transition_capture->tcs_update_new_table)
> {
> ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> NULL,
> slot,
> NULL,
> mtstate->mt_transition_capture);
>
> /*
> * We've already captured the NEW TABLE row, so make sure any AR
> * INSERT trigger fired below doesn't capture it again.
> */
> ar_insert_trig_tcs = NULL;
> }
>
> /* AFTER ROW INSERT Triggers */
> ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> ar_insert_trig_tcs);
>
> Besides not using the just defined ar_insert_trig_tcs and instead
> repeatedly referring to mtstate->mt_transition_capture, wouldn't this be
> a easier to understand if the were an if/else, instead of resetting
> ar_insert_trig_tcs? If the block were
>
> /*
> * triggers behave differently depending on this being a delete as
> * part of a partion move, or a deletion proper.
> if (mtstate->operation == CMD_UPDATE)
> {
> /*
> * If this insert is the result of a partition key update that moved the
> * tuple to a new partition, put this row into the transition NEW TABLE,
> * if there is one. We need to do this separately for DELETE and INSERT
> * because they happen on different tables.
> */
> ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> NULL,
> slot,
> NULL,
> mtstate->mt_transition_capture);
>
> /*
> * But we do want to fire plain per-row INSERT triggers on the
> * new table. By not passing in transition_capture we prevent
> * ....
> */
> ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> NULL);
> }
> else
> {
> /* AFTER ROW INSERT Triggers */
> ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> ar_insert_trig_tcs);
> }

Maybe you meant to use mtstate->mt_transition_capture instead of
ar_insert_trig_tcs in the else block. We don't need
ar_insert_trig_tcs at all.

> it seems like it'd be quite a bit clearer (although I do think the
> comments also need a fair bit of polishing independent of this proposed
> change).

Fwiw, I agree with your proposed restructuring, although I'd let Amit
Kh chime in as he'd be more familiar with this code. I wasn't aware
of this partitioning-related bit being present in ExecInsert().

Would you like me to write a patch for some or all items?

Thanks,
Amit

[1] https://www.postgresql.org/message-id/flat/0473bf5c-57b1-f1f7-3d58-455c2230bc5f%40lab.ntt.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-07-18 05:29:09 Re: pg_receivewal documentation
Previous Message Tom Lane 2019-07-18 05:04:13 Re: Add client connection check during the execution of the query