Re: [HACKERS] MERGE SQL Statement for PG11

From: Andres Freund <andres(at)anarazel(dot)de>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] MERGE SQL Statement for PG11
Date: 2018-04-03 02:18:00
Message-ID: 20180403021800.b5nsgiclzanobiup@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-03-30 12:10:04 +0100, Simon Riggs wrote:
> On 29 March 2018 at 10:50, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On 28 March 2018 at 12:00, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
> >
> >> v27 attached, though review changes are in
> >> the add-on 0005 patch.
> >
> > This all looks good now, thanks for making all of those changes.
> >
> > I propose [v27 patch1+patch3+patch5] as the initial commit candidate
> > for MERGE, with other patches following later before end CF.
> >
> > I propose to commit this tomorrow, 30 March, about 26 hours from now.
> > That will allow some time for buildfarm fixing/reversion before the
> > Easter weekend, then other patches to follow starting 2 April. That
> > then gives reasonable time to follow up on other issues that we will
> > no doubt discover fairly soon after commit, such as additional runs by
> > SQLsmith and more eyeballs.
>
> No problems found, but moving proposed commit to 2 April pm

I did a scan through this, as I hadn't been able to keep with the thread
previously. Sorry if some of the things mentioned here have been
discussed previously. I am just reading through the patch in its own
order, so please excuse if there's things I remark on that only later
fully make sense.

later update: TL;DR: I don't think the parser / executor implementation
of MERGE is architecturally sound. I think creating hidden joins during
parse-analysis to implement MERGE is a seriously bad idea and it needs
to be replaced by a different executor structure.

@@ -3828,8 +3846,14 @@ struct AfterTriggersTableData
bool before_trig_done; /* did we already queue BS triggers? */
bool after_trig_done; /* did we already queue AS triggers? */
AfterTriggerEventList after_trig_events; /* if so, saved list pointer */
- Tuplestorestate *old_tuplestore; /* "old" transition table, if any */
- Tuplestorestate *new_tuplestore; /* "new" transition table, if any */
+ /* "old" transition table for UPDATE, if any */
+ Tuplestorestate *old_upd_tuplestore;
+ /* "new" transition table for UPDATE, if any */
+ Tuplestorestate *new_upd_tuplestore;
+ /* "old" transition table for DELETE, if any */
+ Tuplestorestate *old_del_tuplestore;
+ /* "new" transition table INSERT, if any */
+ Tuplestorestate *new_ins_tuplestore;
};

A comment somewhere why we have all of these (presumably because they
can now happen in the context of a single statement) would be good.

@@ -5744,12 +5796,28 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
newtup == NULL));

if (oldtup != NULL &&
- ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
- (event == TRIGGER_EVENT_UPDATE && update_old_table)))
+ (event == TRIGGER_EVENT_DELETE && delete_old_table))
{
Tuplestorestate *old_tuplestore;

- old_tuplestore = transition_capture->tcs_private->old_tuplestore;
+ old_tuplestore = transition_capture->tcs_private->old_del_tuplestore;
+
+ if (map != NULL)
+ {
+ HeapTuple converted = do_convert_tuple(oldtup, map);
+
+ tuplestore_puttuple(old_tuplestore, converted);
+ pfree(converted);
+ }
+ else
+ tuplestore_puttuple(old_tuplestore, oldtup);
+ }

Very similar code is now repeated four times. Could you abstract this
into a separate function?

--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -37,6 +37,16 @@ the plan tree returns the computed tuples to be updated, plus a "junk"
one. For DELETE, the plan tree need only deliver a CTID column, and the
ModifyTable node visits each of those rows and marks the row deleted.

+MERGE runs one generic plan that returns candidate target rows. Each row
+consists of a super-row that contains all the columns needed by any of the
+individual actions, plus a CTID and a TABLEOID junk columns. The CTID column is

"plus a CTID and a TABLEOID junk columns" sounds a bit awkward?

+required to know if a matching target row was found or not and the TABLEOID
+column is needed to find the underlying target partition, in case when the
+target table is a partition table. If the CTID column is set we attempt to
+activate WHEN MATCHED actions, or if it is NULL then we will attempt to

"if it is NULL then" sounds wrong.

+activate WHEN NOT MATCHED actions. Once we know which action is activated we
+form the final result row and apply only those changes.
+
XXX a great deal more documentation needs to be written here...

Are we using tableoids in a similar way in other places?

+/*
+ * Given OID of the partition leaf, return the index of the leaf in the
+ * partition hierarchy.
+ */
+int
+ExecFindPartitionByOid(PartitionTupleRouting *proute, Oid partoid)
+{
+ int i;
+
+ for (i = 0; i < proute->num_partitions; i++)
+ {
+ if (proute->partition_oids[i] == partoid)
+ break;
+ }
+
+ Assert(i < proute->num_partitions);
+ return i;
+}

Shouldn't we at least warn in a comment that this is O(N)? And document
that it does weird stuff if the OID isn't found?

Perhaps just introduce a PARTOID syscache?

diff --git a/src/backend/executor/nodeMerge.c b/src/backend/executor/nodeMerge.c
new file mode 100644
index 00000000000..0e0d0795d4d
--- /dev/null
+++ b/src/backend/executor/nodeMerge.c
@@ -0,0 +1,575 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeMerge.c
+ * routines to handle Merge nodes relating to the MERGE command

Isn't this file misnamed and it should be execMerge.c? The rest of the
node*.c files are for nodes that are invoked via execProcnode /
ExecProcNode(). This isn't an actual executor node.

Might also be worthwhile to move the merge related init stuff here?

What's the reasoning behind making this be an anomaluous type of
executor node?

+/*
+ * Perform MERGE.
+ */
+void
+ExecMerge(ModifyTableState *mtstate, EState *estate, TupleTableSlot *slot,
+ JunkFilter *junkfilter, ResultRelInfo *resultRelInfo)
+{

+ * ExecMergeMatched takes care of following the update chain and
+ * re-finding the qualifying WHEN MATCHED action, as long as the updated
+ * target tuple still satisfies the join quals i.e. it still remains a
+ * WHEN MATCHED case. If the tuple gets deleted or the join quals fail, it
+ * returns and we try ExecMergeNotMatched. Given that ExecMergeMatched
+ * always make progress by following the update chain and we never switch
+ * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
+ * livelock.
+ */
+ if (matched)
+ matched = ExecMergeMatched(mtstate, estate, slot, junkfilter, tupleid);
+
+ /*
+ * Either we were dealing with a NOT MATCHED tuple or ExecMergeNotMatched()
+ * returned "false", indicating the previously MATCHED tuple is no longer a
+ * matching tuple.
+ */
+ if (!matched)
+ ExecMergeNotMatched(mtstate, estate, slot);

So what happens if there's a concurrent insertion of a potentially
matching tuple?

FWIW, I'd re-order this file so this routine is above
ExecMergeMatched(), ExecMergeNotMatched(), easier to understand.

+static bool
+ExecMergeMatched(ModifyTableState *mtstate, EState *estate,
+ TupleTableSlot *slot, JunkFilter *junkfilter,
+ ItemPointer tupleid)
+{
+ ExprContext *econtext = mtstate->ps.ps_ExprContext;
+ bool isNull;
+ List *mergeMatchedActionStates = NIL;
+ HeapUpdateFailureData hufd;
+ bool tuple_updated,
+ tuple_deleted;
+ Buffer buffer;
+ HeapTupleData tuple;
+ EPQState *epqstate = &mtstate->mt_epqstate;
+ ResultRelInfo *saved_resultRelInfo;
+ ResultRelInfo *resultRelInfo = estate->es_result_relation_info;
+ ListCell *l;
+ TupleTableSlot *saved_slot = slot;
+
+ if (mtstate->mt_partition_tuple_routing)
+ {
+ Datum datum;
+ Oid tableoid = InvalidOid;
+ int leaf_part_index;
+ PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+ /*
+ * In case of partitioned table, we fetch the tableoid while performing
+ * MATCHED MERGE action.
+ */
+ datum = ExecGetJunkAttribute(slot, junkfilter->jf_otherJunkAttNo,
+ &isNull);
+ Assert(!isNull);
+ tableoid = DatumGetObjectId(datum);
+
+ /*
+ * If we're dealing with a MATCHED tuple, then tableoid must have been
+ * set correctly. In case of partitioned table, we must now fetch the
+ * correct result relation corresponding to the child table emitting
+ * the matching target row. For normal table, there is just one result
+ * relation and it must be the one emitting the matching row.
+ */
+ leaf_part_index = ExecFindPartitionByOid(proute, tableoid);
+
+ resultRelInfo = proute->partitions[leaf_part_index];
+ if (resultRelInfo == NULL)
+ {
+ resultRelInfo = ExecInitPartitionInfo(mtstate,
+ mtstate->resultRelInfo,
+ proute, estate, leaf_part_index);
+ Assert(resultRelInfo != NULL);
+ }
+ }
+
+ /*
+ * Save the current information and work with the correct result relation.
+ */
+ saved_resultRelInfo = resultRelInfo;
+ estate->es_result_relation_info = resultRelInfo;
+
+ /*
+ * And get the correct action lists.
+ */
+ mergeMatchedActionStates =
+ resultRelInfo->ri_mergeState->matchedActionStates;
+
+ /*
+ * If there are not WHEN MATCHED actions, we are done.
+ */
+ if (mergeMatchedActionStates == NIL)
+ return true;

Maybe I'm confused, but why is mergeMatchedActionStates attached to
per-partition info? How can this differ in number between partitions,
requiring us to re-check it below fetching the partition info above?

+ /*
+ * Check for any concurrent update/delete operation which may have
+ * prevented our update/delete. We also check for situations where we
+ * might be trying to update/delete the same tuple twice.
+ */
+ if ((action->commandType == CMD_UPDATE && !tuple_updated) ||
+ (action->commandType == CMD_DELETE && !tuple_deleted))
+
+ {
+ switch (hufd.result)
+ {
+ case HeapTupleMayBeUpdated:
+ break;
+ case HeapTupleInvisible:
+
+ /*
+ * This state should never be reached since the underlying
+ * JOIN runs with a MVCC snapshot and should only return
+ * rows visible to us.
+ */

Given EPQ, that reasoning isn't correct. I think this should still be
unreachable, just not for the reason described here.

+ case HeapTupleUpdated:
+
+ /*
+ * The target tuple was concurrently updated/deleted by
+ * some other transaction.
+ *
+ * If the current tuple is that last tuple in the update
+ * chain, then we know that the tuple was concurrently
+ * deleted. Just return and let the caller try NOT MATCHED
+ * actions.
+ *
+ * If the current tuple was concurrently updated, then we
+ * must run the EvalPlanQual() with the new version of the
+ * tuple. If EvalPlanQual() does not return a tuple then
+ * we switch to the NOT MATCHED list of actions.
+ * If it does return a tuple and the join qual is
+ * still satisfied, then we just need to recheck the
+ * MATCHED actions, starting from the top, and execute the
+ * first qualifying action.
+ */
+ if (!ItemPointerEquals(tupleid, &hufd.ctid))
+ {
+ TupleTableSlot *epqslot;
+
+ /*
+ * Since we generate a JOIN query with a target table
+ * RTE different than the result relation RTE, we must
+ * pass in the RTI of the relation used in the join
+ * query and not the one from result relation.
+ */
+ Assert(resultRelInfo->ri_mergeTargetRTI > 0);
+ epqslot = EvalPlanQual(estate,
+ epqstate,
+ resultRelInfo->ri_RelationDesc,
+ GetEPQRangeTableIndex(resultRelInfo),
+ LockTupleExclusive,
+ &hufd.ctid,
+ hufd.xmax);
+
+ if (!TupIsNull(epqslot))
+ {
+ (void) ExecGetJunkAttribute(epqslot,
+ resultRelInfo->ri_junkFilter->jf_junkAttNo,
+ &isNull);
+
+ /*
+ * A non-NULL ctid means that we are still dealing
+ * with MATCHED case. But we must retry from the
+ * start with the updated tuple to ensure that the
+ * first qualifying WHEN MATCHED action is
+ * executed.
+ *
+ * We don't use the new slot returned by
+ * EvalPlanQual because we anyways re-install the
+ * new target tuple in econtext->ecxt_scantuple
+ * before re-evaluating WHEN AND conditions and
+ * re-projecting the update targetlists. The
+ * source side tuple does not change and hence we
+ * can safely continue to use the old slot.
+ */
+ if (!isNull)
+ {
+ /*
+ * Must update *tupleid to the TID of the
+ * newer tuple found in the update chain.
+ */
+ *tupleid = hufd.ctid;
+ ReleaseBuffer(buffer);
+ goto lmerge_matched;

It seems fairly bad architecturally to me that we now have
EvalPlanQual() loops in both this routine *and*
ExecUpdate()/ExecDelete().

+
+/*
+ * Execute the first qualifying NOT MATCHED action.
+ */
+static void
+ExecMergeNotMatched(ModifyTableState *mtstate, EState *estate,
+ TupleTableSlot *slot)
+{
+ PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+ ExprContext *econtext = mtstate->ps.ps_ExprContext;
+ List *mergeNotMatchedActionStates = NIL;
+ ResultRelInfo *resultRelInfo;
+ ListCell *l;
+ TupleTableSlot *myslot;
+
+ /*
+ * We are dealing with NOT MATCHED tuple. Since for MERGE, partition tree

*the partition tree

+ * is not expanded for the result relation, we continue to work with the
+ * currently active result relation, which should be of the root of the
+ * partition tree.
+ */
+ resultRelInfo = mtstate->resultRelInfo;

"should be"? "is", I hope? Given that it's referencing
mtstate->resultRelInfo which is only set in one place...

+/* flags for mt_merge_subcommands */
+#define MERGE_INSERT 0x01
+#define MERGE_UPDATE 0x02
+#define MERGE_DELETE 0x04

Hm, it seems a bit weird to define these file-local, given there's
another file implementing a good chunk of merge.

* ExecWithCheckOptions() will skip any WCOs which are not of the kind
@@ -617,10 +627,19 @@ ExecInsert(ModifyTableState *mtstate,
* passed to foreign table triggers; it is NULL when the foreign
* table has no relevant triggers.
*
+ * MERGE passes actionState of the action it's currently executing;
+ * regular DELETE passes NULL. This is used by ExecDelete to know if it's
+ * being called from MERGE or regular DELETE operation.
+ *
+ * If the DELETE fails because the tuple is concurrently updated/deleted
+ * by this or some other transaction, hufdp is filled with the reason as
+ * well as other important information. Currently only MERGE needs this
+ * information.
+ *
* Returns RETURNING result if any, otherwise NULL.
* ----------------------------------------------------------------
*/
-static TupleTableSlot *
+TupleTableSlot *
ExecDelete(ModifyTableState *mtstate,
ItemPointer tupleid,
HeapTuple oldtuple,
@@ -629,6 +648,8 @@ ExecDelete(ModifyTableState *mtstate,
EState *estate,
bool *tupleDeleted,
bool processReturning,
+ HeapUpdateFailureData *hufdp,
+ MergeActionState *actionState,
bool canSetTag)
{
ResultRelInfo *resultRelInfo;
@@ -641,6 +662,14 @@ ExecDelete(ModifyTableState *mtstate,
if (tupleDeleted)
*tupleDeleted = false;

+ /*
+ * Initialize hufdp. Since the caller is only interested in the failure
+ * status, initialize with the state that is used to indicate successful
+ * operation.
+ */
+ if (hufdp)
+ hufdp->result = HeapTupleMayBeUpdated;
+

This signals for me that the addition of the result field to HUFD wasn't
architecturally the right thing. HUFD is supposed to be supposed to be
returned by heap_update(), reusing and setting it from other places
seems like a layering violation to me.

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 69dd327f0c9..cd540a0df5b 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -851,6 +851,60 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
fix_scan_list(root, splan->exclRelTlist, rtoffset);
}

+ /*
+ * The MERGE produces the target rows by performing a right

s/the MERGE/the MERGE statement/?

+ * join between the target relation and the source relation
+ * (which could be a plain relation or a subquery). The INSERT
+ * and UPDATE actions of the MERGE requires access to the

same.

+opt_and_condition:

Renaming this to be a bit more merge specific seems like a good idea.

+/*-------------------------------------------------------------------------
+ *
+ * parse_merge.c
+ * handle merge-statement in parser

+/*
+ * Special handling for MERGE statement is required because we assemble
+ * the query manually. This is similar to setTargetTable() followed
+ * by transformFromClause() but with a few less steps.
+ *
+ * Process the FROM clause and add items to the query's range table,
+ * joinlist, and namespace.
+ *
+ * A special targetlist comprising of the columns from the right-subtree of
+ * the join is populated and returned. Note that when the JoinExpr is
+ * setup by transformMergeStmt, the left subtree has the target result
+ * relation and the right subtree has the source relation.
+ *
+ * Returns the rangetable index of the target relation.
+ */
+static int
+transformMergeJoinClause(ParseState *pstate, Node *merge,
+ List **mergeSourceTargetList)
+{
+ RangeTblEntry *rte,
+ *rt_rte;
+ List *namespace;
+ int rtindex,
+ rt_rtindex;
+ Node *n;
+ int mergeTarget_relation = list_length(pstate->p_rtable) + 1;
+ Var *var;
+ TargetEntry *te;
+
+ n = transformFromClauseItem(pstate, merge,
+ &rte,
+ &rtindex,
+ &rt_rte,
+ &rt_rtindex,
+ &namespace);
+
+ pstate->p_joinlist = list_make1(n);
+
+ /*
+ * We created an internal join between the target and the source relation
+ * to carry out the MERGE actions. Normally such an unaliased join hides
+ * the joining relations, unless the column references are qualified.
+ * Also, any unqualified column references are resolved to the Join RTE, if
+ * there is a matching entry in the targetlist. But the way MERGE
+ * execution is later setup, we expect all column references to resolve to
+ * either the source or the target relation. Hence we must not add the
+ * Join RTE to the namespace.
+ *
+ * The last entry must be for the top-level Join RTE. We don't want to
+ * resolve any references to the Join RTE. So discard that.
+ *
+ * We also do not want to resolve any references from the leftside of the
+ * Join since that corresponds to the target relation. References to the
+ * columns of the target relation must be resolved from the result
+ * relation and not the one that is used in the join. So the
+ * mergeTarget_relation is marked invisible to both qualified as well as
+ * unqualified references.
+ */

*Gulp*. I'm extremely doubtful this is a sane approach. At the very
least the amount of hackery required to make existing code cope with
the approach / duplicate code seems indicative of that.

+ Assert(list_length(namespace) > 1);
+ namespace = list_truncate(namespace, list_length(namespace) - 1);
+ pstate->p_namespace = list_concat(pstate->p_namespace, namespace);
+
+ setNamespaceVisibilityForRTE(pstate->p_namespace,
+ rt_fetch(mergeTarget_relation, pstate->p_rtable), false, false);
+
+ /*
+ * Expand the right relation and add its columns to the
+ * mergeSourceTargetList. Note that the right relation can either be a
+ * plain relation or a subquery or anything that can have a
+ * RangeTableEntry.
+ */
+ *mergeSourceTargetList = expandSourceTL(pstate, rt_rte, rt_rtindex);
+
+ /*
+ * Add a whole-row-Var entry to support references to "source.*".
+ */
+ var = makeWholeRowVar(rt_rte, rt_rtindex, 0, false);
+ te = makeTargetEntry((Expr *) var, list_length(*mergeSourceTargetList) + 1,
+ NULL, true);

Why can't this properly dealt with by transformWholeRowRef() etc?

+/*
+ * transformMergeStmt -
+ * transforms a MERGE statement
+ */
+Query *
+transformMergeStmt(ParseState *pstate, MergeStmt *stmt)
+{
+ Query *qry = makeNode(Query);
+ ListCell *l;
+ AclMode targetPerms = ACL_NO_RIGHTS;
+ bool is_terminal[2];
+ JoinExpr *joinexpr;
+ RangeTblEntry *resultRelRTE, *mergeRelRTE;
+
+ /* There can't be any outer WITH to worry about */
+ Assert(pstate->p_ctenamespace == NIL);
+
+ qry->commandType = CMD_MERGE;
+
+ /*
+ * Check WHEN clauses for permissions and sanity
+ */
+ is_terminal[0] = false;
+ is_terminal[1] = false;
+ foreach(l, stmt->mergeActionList)
+ {
+ MergeAction *action = (MergeAction *) lfirst(l);
+ uint when_type = (action->matched ? 0 : 1);
+
+ /*
+ * Collect action types so we can check Target permissions
+ */
+ switch (action->commandType)
+ {
+ case CMD_INSERT:
+ {
+ InsertStmt *istmt = (InsertStmt *) action->stmt;
+ SelectStmt *selectStmt = (SelectStmt *) istmt->selectStmt;
+
+ /*
+ * The grammar allows attaching ORDER BY, LIMIT, FOR
+ * UPDATE, or WITH to a VALUES clause and also multiple
+ * VALUES clauses. If we have any of those, ERROR.
+ */
+ if (selectStmt && (selectStmt->valuesLists == NIL ||
+ selectStmt->sortClause != NIL ||
+ selectStmt->limitOffset != NULL ||
+ selectStmt->limitCount != NULL ||
+ selectStmt->lockingClause != NIL ||
+ selectStmt->withClause != NULL))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("SELECT not allowed in MERGE INSERT statement")));

+ if (selectStmt && list_length(selectStmt->valuesLists) > 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("Multiple VALUES clauses not allowed in MERGE INSERT statement")));

Shouldn't this include an error position?

+ /*
+ * Construct a query of the form
+ * SELECT relation.ctid --junk attribute
+ * ,relation.tableoid --junk attribute
+ * ,source_relation.<somecols>
+ * ,relation.<somecols>
+ * FROM relation RIGHT JOIN source_relation
+ * ON join_condition; -- no WHERE clause - all conditions are applied in
+ * executor
+ *
+ * stmt->relation is the target relation, given as a RangeVar
+ * stmt->source_relation is a RangeVar or subquery
+ *
+ * We specify the join as a RIGHT JOIN as a simple way of forcing the
+ * first (larg) RTE to refer to the target table.
+ *
+ * The MERGE query's join can be tuned in some cases, see below for these
+ * special case tweaks.
+ *
+ * We set QSRC_PARSER to show query constructed in parse analysis
+ *
+ * Note that we have only one Query for a MERGE statement and the planner
+ * is called only once. That query is executed once to produce our stream
+ * of candidate change rows, so the query must contain all of the columns
+ * required by each of the targetlist or conditions for each action.
+ *
+ * As top-level statements INSERT, UPDATE and DELETE have a Query, whereas
+ * with MERGE the individual actions do not require separate planning,
+ * only different handling in the executor. See nodeModifyTable handling
+ * of commandType CMD_MERGE.
+ *
+ * A sub-query can include the Target, but otherwise the sub-query cannot
+ * reference the outermost Target table at all.
+ */
+ qry->querySource = QSRC_PARSER;

Why is this, and not building a proper executor node for merge that
knows how to get the tuples, the right approach? We did a rough
equivalent for matview updates, and I think it turned out to be a pretty
bad plan.

+ /*
+ * XXX MERGE is unsupported in various cases
+ */
+ if (!(pstate->p_target_relation->rd_rel->relkind == RELKIND_RELATION ||
+ pstate->p_target_relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("MERGE is not supported for this relation type")));

Shouldn't this report what relation type the error talking about? It's
not going to necessarily be obvious to the user. Also, errposition etc
would be good.

+ if (pstate->p_target_relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE &&
+ pstate->p_target_relation->rd_rel->relhassubclass)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("MERGE is not supported for relations with inheritance")));

Hm.

+ if (pstate->p_target_relation->rd_rel->relhasrules)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("MERGE is not supported for relations with rules")));

I guess we can add that later, but it's a bit sad that this won't work
against views with INSTEAD OF triggers.

+ foreach(l, stmt->mergeActionList)
+ {
+ MergeAction *action = (MergeAction *) lfirst(l);
+
+ /*
+ * Set namespace for the specific action. This must be done before
+ * analyzing the WHEN quals and the action targetlisst.
+ */
+ setNamespaceForMergeAction(pstate, action);
+
+ /*
+ * Transform the when condition.
+ *
+ * Note that these quals are NOT added to the join quals; instead they
+ * are evaluated separately during execution to decide which of the
+ * WHEN MATCHED or WHEN NOT MATCHED actions to execute.
+ */
+ action->qual = transformWhereClause(pstate, action->condition,
+ EXPR_KIND_MERGE_WHEN_AND, "WHEN");
+
+ /*
+ * Transform target lists for each INSERT and UPDATE action stmt
+ */
+ switch (action->commandType)
+ {
+ case CMD_INSERT:
+ {
+ InsertStmt *istmt = (InsertStmt *) action->stmt;
+ SelectStmt *selectStmt = (SelectStmt *) istmt->selectStmt;
+ List *exprList = NIL;
+ ListCell *lc;
+ RangeTblEntry *rte;
+ ListCell *icols;
+ ListCell *attnos;
+ List *icolumns;
+ List *attrnos;
+
+ pstate->p_is_insert = true;
+
+ icolumns = checkInsertTargets(pstate, istmt->cols, &attrnos);
+ Assert(list_length(icolumns) == list_length(attrnos));
+
+ /*
+ * Handle INSERT much like in transformInsertStmt
+ */
+ if (selectStmt == NULL)
+ {
+ /*
+ * We have INSERT ... DEFAULT VALUES. We can handle
+ * this case by emitting an empty targetlist --- all
+ * columns will be defaulted when the planner expands
+ * the targetlist.
+ */
+ exprList = NIL;
+ }
+ else
+ {
+ /*
+ * Process INSERT ... VALUES with a single VALUES
+ * sublist. We treat this case separately for
+ * efficiency. The sublist is just computed directly
+ * as the Query's targetlist, with no VALUES RTE. So
+ * it works just like a SELECT without any FROM.
+ */
+ List *valuesLists = selectStmt->valuesLists;
+
+ Assert(list_length(valuesLists) == 1);
+ Assert(selectStmt->intoClause == NULL);
+
+ /*
+ * Do basic expression transformation (same as a ROW()
+ * expr, but allow SetToDefault at top level)
+ */
+ exprList = transformExpressionList(pstate,
+ (List *) linitial(valuesLists),
+ EXPR_KIND_VALUES_SINGLE,
+ true);
+
+ /* Prepare row for assignment to target table */
+ exprList = transformInsertRow(pstate, exprList,
+ istmt->cols,
+ icolumns, attrnos,
+ false);
+ }

Can't we handle this with a littlebit less code duplication?

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a4..608f50b0616 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -53,23 +53,34 @@ typedef enum LockTupleMode
* When heap_update, heap_delete, or heap_lock_tuple fail because the target
* tuple is already outdated, they fill in this struct to provide information
* to the caller about what happened.
+ *
+ * result is the result of HeapTupleSatisfiesUpdate, leading to the failure.
+ * It's set to HeapTupleMayBeUpdated when there is no failure.
+ *
* ctid is the target's ctid link: it is the same as the target's TID if the
* target was deleted, or the location of the replacement tuple if the target
* was updated.
+ *
* xmax is the outdating transaction's XID. If the caller wants to visit the
* replacement tuple, it must check that this matches before believing the
* replacement is really a match.
+ *
* cmax is the outdating command's CID, but only when the failure code is
* HeapTupleSelfUpdated (i.e., something in the current transaction outdated
* the tuple); otherwise cmax is zero. (We make this restriction because
* HeapTupleHeaderGetCmax doesn't work for tuples outdated in other
* transactions.)
+ *
+ * lockmode is only relevant for callers of heap_update() and is the mode which
+ * the caller should use in case it needs to lock the updated tuple.
*/
typedef struct HeapUpdateFailureData
{
+ HTSU_Result result;
ItemPointerData ctid;
TransactionId xmax;
CommandId cmax;
+ LockTupleMode lockmode;
} HeapUpdateFailureData;

These new fields seem not really relateto HUFD, but rather just fields
the merge code should maintain?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-04-03 02:40:12 Re: [HACKERS] MERGE SQL Statement for PG11
Previous Message Thomas Munro 2018-04-03 01:29:28 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS