Re: partition routing layering in nodeModifyTable.c

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, amitdkhan(dot)pg(at)gmail(dot)com
Subject: Re: partition routing layering in nodeModifyTable.c
Date: 2019-07-30 07:20:53
Message-ID: CA+HiwqFDGhGT8AXUQ4S5ytzuLNm18Qbu1js7XG3eYk5h-zhxmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

Sorry about the delay in replying as I was on vacation for the last few days.

On Sat, Jul 20, 2019 at 1:52 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The first one (0001) deals with reducing the core executor's reliance
> > on es_result_relation_info to access the currently active result
> > relation, in favor of receiving it from the caller as a function
> > argument. So no piece of core code relies on it being correctly set
> > anymore. It still needs to be set correctly for the third-party code
> > such as FDWs.
>
> I'm inclined to just remove it. There's not much code out there relying
> on it, as far as I can tell. Most FDWs don't support the direct modify
> API, and that's afaict the case where we one needs to use
> es_result_relation_info?

Right, only the directly modify API uses it.

> In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers
> that are on github and in first few categories (up and including to
> "file wrappers"), and there was only one reference to
> es_result_relation_info, and that just in comments in a test:
> https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93&q=es_result_relation_info&type=
> which I think was just copied from our source code.
>
> IOW, we should just change the direct modify calls to get the relevant
> ResultRelationInfo or something in that vein (perhaps just the relevant
> RT index?).

It seems easy to make one of the two functions that constitute the
direct modify API, IterateDirectModify(), access the result relation
from ForeignScanState by saving either the result relation RT index or
ResultRelInfo pointer itself into the ForeignScanState's FDW-private
area. For example, for postgres_fdw, one would simply add a new
member to PgFdwDirectModifyState struct.

Doing that for the other function BeginDirectModify() seems a bit more
involved. We could add a new field to ForeignScan, say
resultRelation, that's set by either PlanDirectModify() (the FDW code)
or make_modifytable() (the core code) if the ForeignScan node contains
the command for direct modification. BeginDirectModify() can then use
that value instead of relying on es_result_relation_info being set.

Thoughts? Fujita-san, do you have any opinion on whether that would
be a good idea?

> pglogical also references it, but just because it creates its own
> EState afaict.

That sounds easily manageable.

> > @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot)
> > * ExecInsert
> > *
> > * For INSERT, we have to insert the tuple into the target relation
> > - * and insert appropriate tuples into the index relations.
> > + * (or partition thereof) and insert appropriate tuples into the index
> > + * relations.
> > *
> > * Returns RETURNING result if any, otherwise NULL.
> > + *
> > + * This may change the currently active tuple conversion map in
> > + * mtstate->mt_transition_capture, so the callers must take care to
> > + * save the previous value to avoid losing track of it.
> > * ----------------------------------------------------------------
> > */
> > static TupleTableSlot *
> > ExecInsert(ModifyTableState *mtstate,
> > + ResultRelInfo *resultRelInfo,
> > TupleTableSlot *slot,
> > TupleTableSlot *planSlot,
> > EState *estate,
> > bool canSetTag)
> > {
> > - ResultRelInfo *resultRelInfo;
> > Relation resultRelationDesc;
> > List *recheckIndexes = NIL;
> > TupleTableSlot *result = NULL;
> > TransitionCaptureState *ar_insert_trig_tcs;
> > ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
> > OnConflictAction onconflict = node->onConflictAction;
> > + PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> > +
> > + /*
> > + * If the input result relation is a partitioned table, find the leaf
> > + * partition to insert the tuple into.
> > + */
> > + if (proute)
> > + {
> > + ResultRelInfo *partRelInfo;
> > +
> > + slot = ExecPrepareTupleRouting(mtstate, estate, proute,
> > + resultRelInfo, slot,
> > + &partRelInfo);
> > + resultRelInfo = partRelInfo;
> > + /* Result relation has changed, so update EState reference too. */
> > + estate->es_result_relation_info = resultRelInfo;
> > + }
>
> I think by removing es_result_relation entirely, this would look
> cleaner.

I agree. Maybe, setting es_result_relation_info here isn't really
needed, because the ResultRelInfo is directly passed through
ExecForeignInsert. Still, some FDWs may be relying on
es_result_relation_info being correctly set despite the
aforementioned. Again, the only way to get them to stop doing so may
be to remove it.

> > @@ -1271,18 +1274,18 @@ lreplace:;
> > mtstate->mt_root_tuple_slot);
> >
> > /*
> > - * Prepare for tuple routing, making it look like we're inserting
> > - * into the root.
> > + * ExecInsert() may scribble on mtstate->mt_transition_capture,
> > + * so save the currently active map.
> > */
> > + if (mtstate->mt_transition_capture)
> > + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> Wonder if we could remove the need for this somehow, it's still pretty
> darn ugly. Thomas, perhaps you have some insights?
>
> To me the need to modify these ModifyTable wide state on a per-subplan
> and even per-partition basis indicates that the datastructures are in
> the wrong place.

I agree that having to ensure tcs_map is set correctly is cumbersome,
because it has to be reset every time the currently active result
relation changes. I think a better place for the map to be is
ResultRelInfo itself. The trigger code can just get the correct map
from the ResultRelInfo of the result relation it's processing.

Regarding that idea, the necessary map is already present in the
tuple-routing state struct that's embedded in the partition's
ResultRelInfo. But the UPDATE result relations that are never
processed as tuple routing targets don't have routing info initialized
(also think non-partition inheritance children), so we could add
another TupleConversionMap * field in ResultRelInfo. Attached patch
0003 implements that.

With this change, we no longer need to track the map in a global
variable, that is, TransitionCaptureState no longer needs tcs_map. We
still have tcs_original_insert_tuple though, which must be set during
ExecInsert and reset after it's read by AfterTriggerSaveEvent. I have
moved the resetting of its value to right after where the originally
set value is read to make it clear that the value must be read only
once.

> > @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate)
> > switch (operation)
> > {
> > case CMD_INSERT:
> > - /* Prepare for tuple routing if needed. */
> > - if (proute)
> > - slot = ExecPrepareTupleRouting(node, estate, proute,
> > - resultRelInfo, slot);
> > - slot = ExecInsert(node, slot, planSlot,
> > + slot = ExecInsert(node, resultRelInfo, slot, planSlot,
> > estate, node->canSetTag);
> > - /* Revert ExecPrepareTupleRouting's state change. */
> > - if (proute)
> > - estate->es_result_relation_info = resultRelInfo;
> > break;
> > case CMD_UPDATE:
> > - slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
> > - &node->mt_epqstate, estate, node->canSetTag);
> > + slot = ExecUpdate(node, resultRelInfo, tupleid, oldtuple, slot,
> > + planSlot, &node->mt_epqstate, estate,
> > + node->canSetTag);
> > break;
> > case CMD_DELETE:
> > - slot = ExecDelete(node, tupleid, oldtuple, planSlot,
> > - &node->mt_epqstate, estate,
> > + slot = ExecDelete(node, resultRelInfo, tupleid, oldtuple,
> > + planSlot, &node->mt_epqstate, estate,
> > true, node->canSetTag,
> > false /* changingPart */ , NULL, NULL);
> > break;
>
> This reminds me of another complaint: ExecDelete and ExecInsert() have
> gotten more boolean parameters for partition moving, but only one of
> them is explained with a comment (/* changingPart */) - think we should
> do that for all.

Agree about the confusing state of ExecDelete call sites. I've
reformatted the calls to properly label the arguments (the changes are
contained in the revised 0001). I don't see many
partitioning-specific boolean parameters in ExecInsert though.

> > diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> > index 43edfef089..7df3e78b22 100644
> > --- a/src/backend/replication/logical/worker.c
> > +++ b/src/backend/replication/logical/worker.c
> > @@ -173,10 +173,10 @@ ensure_transaction(void)
> > * This is based on similar code in copy.c
> > */
> > static EState *
> > -create_estate_for_relation(LogicalRepRelMapEntry *rel)
> > +create_estate_for_relation(LogicalRepRelMapEntry *rel,
> > + ResultRelInfo **resultRelInfo)
> > {
> > EState *estate;
> > - ResultRelInfo *resultRelInfo;
> > RangeTblEntry *rte;
> >
> > estate = CreateExecutorState();
> > @@ -188,12 +188,11 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
> > rte->rellockmode = AccessShareLock;
> > ExecInitRangeTable(estate, list_make1(rte));
> >
> > - resultRelInfo = makeNode(ResultRelInfo);
> > - InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
> > + *resultRelInfo = makeNode(ResultRelInfo);
> > + InitResultRelInfo(*resultRelInfo, rel->localrel, 1, NULL, 0);
> >
> > - estate->es_result_relations = resultRelInfo;
> > + estate->es_result_relations = *resultRelInfo;
> > estate->es_num_result_relations = 1;
> > - estate->es_result_relation_info = resultRelInfo;
> >
> > estate->es_output_cid = GetCurrentCommandId(true);
> >
> > @@ -567,6 +566,7 @@ GetRelationIdentityOrPK(Relation rel)
> > static void
> > apply_handle_insert(StringInfo s)
> > {
> > + ResultRelInfo *resultRelInfo;
> > LogicalRepRelMapEntry *rel;
> > LogicalRepTupleData newtup;
> > LogicalRepRelId relid;
> > @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
> > }
> >
> > /* Initialize the executor state. */
> > - estate = create_estate_for_relation(rel);
> > + estate = create_estate_for_relation(rel, &resultRelInfo);
>
> Hm. It kinda seems cleaner if we were to instead return the relevant
> index, rather than the entire ResultRelInfo, as an output from
> create_estate_for_relation(). Makes it clearer that it's still in the
> EState.

For now, I've reverted these changes in favor of just doing this:

/* Initialize the executor state. */
estate = create_estate_for_relation(rel);
+ resultRelInfo = &estate->es_result_relations[0];

This seems OK as we know for sure that there is only one target relation.

> Or perhaps we ought to compute it in a separate step? Then that'd be
> more amenable to support replcating into partition roots.

If we think of create_estate_for_relation() being like InitPlan(),
then perhaps it makes sense to leave it as is. Any setup needed for
replicating into partition roots will have to be in a separate
function anyway.

> > +/*
> > + * ExecMove
> > + * Move an updated tuple from the input result relation to the
> > + * new partition of its root parent table
> > + *
> > + * This works by first deleting the tuple from the input result relation
> > + * followed by inserting it into the root parent table, that is,
> > + * mtstate->rootResultRelInfo.
> > + *
> > + * Returns true if it's detected that the tuple we're trying to move has
> > + * been concurrently updated.
> > + */
> > +static bool
> > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> > + ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
> > + EPQState *epqstate, bool canSetTag, TupleTableSlot **slot,
> > + TupleTableSlot **inserted_tuple)
> > +{
>
> I know that it was one of the names I proposed, but now that I'm
> thinking about it again, it sounds too generic. Perhaps
> ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since
> there's only one reference the longer name wouldn't be painful.

OK, I've renamed ExecMove to ExecCrossPartitionUpdate.

> > + /*
> > + * Row movement, part 1. Delete the tuple, but skip RETURNING
> > + * processing. We want to return rows from INSERT.
> > + */
> > + ExecDelete(mtstate, resultRelInfo, tupleid, oldtuple, planSlot,
> > + epqstate, estate, false, false /* canSetTag */ ,
> > + true /* changingPart */ , &tuple_deleted, &epqslot);
>
> Here again it'd be nice if all the booleans would be explained with a
> comment.

Done too.

Attached updated 0001, 0002, and the new 0003 for transition tuple
conversion map related refactoring as explained above.

Thanks,
Amit

Attachment Content-Type Size
v2-0003-Refactor-transition-tuple-capture-code-a-bit.patch application/octet-stream 19.4 KB
v2-0002-Rearrange-partition-update-row-movement-code-a-bi.patch application/octet-stream 16.0 KB
v2-0001-Reduce-es_result_relation_info-usage.patch application/octet-stream 35.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-07-30 07:34:29 Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?
Previous Message Thomas Munro 2019-07-30 06:50:29 Re: POC: Cleaning up orphaned files using undo logs