Re: partition routing layering in nodeModifyTable.c

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Langote <amitlangote09(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, amitdkhan(dot)pg(at)gmail(dot)com
Subject: Re: partition routing layering in nodeModifyTable.c
Date: 2019-07-19 16:52:41
Message-ID: 20190719165241.nblajgpf5ys2q5sh@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-07-19 17:52:20 +0900, Amit Langote wrote:
> Attached are two patches.

Awesome.

> The first one (0001) deals with reducing the core executor's reliance
> on es_result_relation_info to access the currently active result
> relation, in favor of receiving it from the caller as a function
> argument. So no piece of core code relies on it being correctly set
> anymore. It still needs to be set correctly for the third-party code
> such as FDWs.

I'm inclined to just remove it. There's not much code out there relying
on it, as far as I can tell. Most FDWs don't support the direct modify
API, and that's afaict the case where we one needs to use
es_result_relation_info?

In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
that are on github and in first few categories (up and including to
"file wrappers"), and there was only one reference to
es_result_relation_info, and that just in comments in a test:
https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
which I think was just copied from our source code.

IOW, we should just change the direct modify calls to get the relevant
ResultRelationInfo or something in that vein (perhaps just the relevant
RT index?).

pglogical also references it, but just because it creates its own
EState afaict.

> @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot)
> * ExecInsert
> *
> * For INSERT, we have to insert the tuple into the target relation
> - * and insert appropriate tuples into the index relations.
> + * (or partition thereof) and insert appropriate tuples into the index
> + * relations.
> *
> * Returns RETURNING result if any, otherwise NULL.
> + *
> + * This may change the currently active tuple conversion map in
> + * mtstate->mt_transition_capture, so the callers must take care to
> + * save the previous value to avoid losing track of it.
> * ----------------------------------------------------------------
> */
> static TupleTableSlot *
> ExecInsert(ModifyTableState *mtstate,
> + ResultRelInfo *resultRelInfo,
> TupleTableSlot *slot,
> TupleTableSlot *planSlot,
> EState *estate,
> bool canSetTag)
> {
> - ResultRelInfo *resultRelInfo;
> Relation resultRelationDesc;
> List *recheckIndexes = NIL;
> TupleTableSlot *result = NULL;
> TransitionCaptureState *ar_insert_trig_tcs;
> ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
> OnConflictAction onconflict = node->onConflictAction;
> + PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> +
> + /*
> + * If the input result relation is a partitioned table, find the leaf
> + * partition to insert the tuple into.
> + */
> + if (proute)
> + {
> + ResultRelInfo *partRelInfo;
> +
> + slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> + resultRelInfo, slot,
> + &partRelInfo);
> + resultRelInfo = partRelInfo;
> + /* Result relation has changed, so update EState reference too. */
> + estate->es_result_relation_info = resultRelInfo;
> + }

I think by removing es_result_relation entirely, this would look
cleaner.

> @@ -1271,18 +1274,18 @@ lreplace:;
> mtstate->mt_root_tuple_slot);
>
> /*
> - * Prepare for tuple routing, making it look like we're inserting
> - * into the root.
> + * ExecInsert() may scribble on mtstate->mt_transition_capture,
> + * so save the currently active map.
> */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

Wonder if we could remove the need for this somehow, it's still pretty
darn ugly. Thomas, perhaps you have some insights?

To me the need to modify these ModifyTable wide state on a per-subplan
and even per-partition basis indicates that the datastructures are in
the wrong place.

> @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
> switch (operation)
> {
> case CMD_INSERT:
> - /* Prepare for tuple routing if needed. */
> - if (proute)
> - slot = ExecPrepareTupleRouting(node, estate, proute,
> - resultRelInfo, slot);
> - slot = ExecInsert(node, slot, planSlot,
> + slot = ExecInsert(node, resultRelInfo, slot, planSlot,
> estate, node->canSetTag);
> - /* Revert ExecPrepareTupleRouting's state change. */
> - if (proute)
> - estate->es_result_relation_info = resultRelInfo;
> break;
> case CMD_UPDATE:
> - slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
> - &node->mt_epqstate, estate, node->canSetTag);
> + slot = ExecUpdate(node, resultRelInfo, tupleid, oldtuple, slot,
> + planSlot, &node->mt_epqstate, estate,
> + node->canSetTag);
> break;
> case CMD_DELETE:
> - slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> - &node->mt_epqstate, estate,
> + slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> + planSlot, &node->mt_epqstate, estate,
> true, node->canSetTag,
> false /* changingPart */ , NULL, NULL);
> break;

This reminds me of another complaint: ExecDelete and ExecInsert() have
gotten more boolean parameters for partition moving, but only one of
them is explained with a comment (/* changingPart */) - think we should
do that for all.

> diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> index 43edfef089..7df3e78b22 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -173,10 +173,10 @@ ensure_transaction(void)
> * This is based on similar code in copy.c
> */
> static EState *
> -create_estate_for_relation(LogicalRepRelMapEntry *rel)
> +create_estate_for_relation(LogicalRepRelMapEntry *rel,
> + ResultRelInfo **resultRelInfo)
> {
> EState *estate;
> - ResultRelInfo *resultRelInfo;
> RangeTblEntry *rte;
>
> estate = CreateExecutorState();
> @@ -188,12 +188,11 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
> rte->rellockmode = AccessShareLock;
> ExecInitRangeTable(estate, list_make1(rte));
>
> - resultRelInfo = makeNode(ResultRelInfo);
> - InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
> + *resultRelInfo = makeNode(ResultRelInfo);
> + InitResultRelInfo(*resultRelInfo, rel->localrel, 1, NULL, 0);
>
> - estate->es_result_relations = resultRelInfo;
> + estate->es_result_relations = *resultRelInfo;
> estate->es_num_result_relations = 1;
> - estate->es_result_relation_info = resultRelInfo;
>
> estate->es_output_cid = GetCurrentCommandId(true);
>
> @@ -567,6 +566,7 @@ GetRelationIdentityOrPK(Relation rel)
> static void
> apply_handle_insert(StringInfo s)
> {
> + ResultRelInfo *resultRelInfo;
> LogicalRepRelMapEntry *rel;
> LogicalRepTupleData newtup;
> LogicalRepRelId relid;
> @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
> }
>
> /* Initialize the executor state. */
> - estate = create_estate_for_relation(rel);
> + estate = create_estate_for_relation(rel, &resultRelInfo);

Hm. It kinda seems cleaner if we were to instead return the relevant
index, rather than the entire ResultRelInfo, as an output from
create_estate_for_relation(). Makes it clearer that it's still in the
EState.

Or perhaps we ought to compute it in a separate step? Then that'd be
more amenable to support replcating into partition roots.

> /*
> - * If this insert is the result of a partition key update that moved the
> - * tuple to a new partition, put this row into the transition NEW TABLE,
> - * if there is one. We need to do this separately for DELETE and INSERT
> - * because they happen on different tables.
> + * If this delete is a part of a partition key update, put this row into
> + * the UPDATE trigger's NEW TABLE instead of that of an INSERT trigger.
> */
> - ar_insert_trig_tcs = mtstate->mt_transition_capture;
> - if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
> - && mtstate->mt_transition_capture->tcs_update_new_table)
> + if (mtstate->operation == CMD_UPDATE &&
> + mtstate->mt_transition_capture &&
> + mtstate->mt_transition_capture->tcs_update_new_table)
> {
> - ExecARUpdateTriggers(estate, resultRelInfo, NULL,
> - NULL,
> - slot,
> - NULL,
> - mtstate->mt_transition_capture);
> + ExecARUpdateTriggers(estate, resultRelInfo, NULL, NULL, slot,
> + NIL, mtstate->mt_transition_capture);
>
> /*
> - * We've already captured the NEW TABLE row, so make sure any AR
> - * INSERT trigger fired below doesn't capture it again.
> + * Execute AFTER ROW INSERT Triggers, but such that the row is not
> + * captured again in the transition table if any.
> */
> - ar_insert_trig_tcs = NULL;
> + ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> + NULL);
> + }
> + else
> + {
> + /* AFTER ROW INSERT Triggers */
> + ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> + mtstate->mt_transition_capture);
> }
> -
> - /* AFTER ROW INSERT Triggers */
> - ExecARInsertTriggers(estate, resultRelInfo, slot, recheckIndexes,
> - ar_insert_trig_tcs);

While a tiny bit more code, perhaps, this is considerably clearer
imo. Thanks.

> +/*
> + * ExecMove
> + * Move an updated tuple from the input result relation to the
> + * new partition of its root parent table
> + *
> + * This works by first deleting the tuple from the input result relation
> + * followed by inserting it into the root parent table, that is,
> + * mtstate->rootResultRelInfo.
> + *
> + * Returns true if it's detected that the tuple we're trying to move has
> + * been concurrently updated.
> + */
> +static bool
> +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> + ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> + EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> + TupleTableSlot **inserted_tuple)
> +{

I know that it was one of the names I proposed, but now that I'm
thinking about it again, it sounds too generic. Perhaps
ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
there's only one reference the longer name wouldn't be painful.

> + /*
> + * Row movement, part 1. Delete the tuple, but skip RETURNING
> + * processing. We want to return rows from INSERT.
> + */
> + ExecDelete(mtstate, resultRelInfo, tupleid, oldtuple, planSlot,
> + epqstate, estate, false, false /* canSetTag */ ,
> + true /* changingPart */ , &tuple_deleted, &epqslot);

Here again it'd be nice if all the booleans would be explained with a
comment.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2019-07-19 17:01:27 Re: pg_receivewal documentation
Previous Message Tomas Vondra 2019-07-19 16:46:35 Re: [sqlsmith] Crash in mcv_get_match_bitmap