Re: [HACKERS] MERGE SQL Statement for PG11

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] MERGE SQL Statement for PG11
Date: 2018-04-05 06:01:48
Message-ID: CABOikdMrWyEvttnzzG-naGAHt+00d=ZuNcGB8tnEB_fKEG+NvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 3, 2018 at 7:48 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:

>
>
> @@ -3828,8 +3846,14 @@ struct AfterTriggersTableData
> bool before_trig_done; /* did we already queue BS triggers? */
> bool after_trig_done; /* did we already queue AS triggers? */
> AfterTriggerEventList after_trig_events; /* if so, saved list
> pointer */
> - Tuplestorestate *old_tuplestore; /* "old" transition table, if any
> */
> - Tuplestorestate *new_tuplestore; /* "new" transition table, if any
> */
> + /* "old" transition table for UPDATE, if any */
> + Tuplestorestate *old_upd_tuplestore;
> + /* "new" transition table for UPDATE, if any */
> + Tuplestorestate *new_upd_tuplestore;
> + /* "old" transition table for DELETE, if any */
> + Tuplestorestate *old_del_tuplestore;
> + /* "new" transition table INSERT, if any */
> + Tuplestorestate *new_ins_tuplestore;
> };
>
> A comment somewhere why we have all of these (presumably because they
> can now happen in the context of a single statement) would be good.
>
>
Done.

>
> @@ -5744,12 +5796,28 @@ AfterTriggerSaveEvent(EState *estate,
> ResultRelInfo *relinfo,
> newtup == NULL));
>
> if (oldtup != NULL &&
> - ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
> - (event == TRIGGER_EVENT_UPDATE && update_old_table)))
> + (event == TRIGGER_EVENT_DELETE && delete_old_table))
> {
> Tuplestorestate *old_tuplestore;
>
> - old_tuplestore = transition_capture->tcs_privat
> e->old_tuplestore;
> + old_tuplestore = transition_capture->tcs_privat
> e->old_del_tuplestore;
> +
> + if (map != NULL)
> + {
> + HeapTuple converted = do_convert_tuple(oldtup, map);
> +
> + tuplestore_puttuple(old_tuplestore, converted);
> + pfree(converted);
> + }
> + else
> + tuplestore_puttuple(old_tuplestore, oldtup);
> + }
>
> Very similar code is now repeated four times. Could you abstract this
> into a separate function?
>
>
Ok. I gave it a try, please check. It's definitely a lot lesser code.

>
>
> --- a/src/backend/executor/README
> +++ b/src/backend/executor/README
> @@ -37,6 +37,16 @@ the plan tree returns the computed tuples to be
> updated, plus a "junk"
> one. For DELETE, the plan tree need only deliver a CTID column, and the
> ModifyTable node visits each of those rows and marks the row deleted.
>
> +MERGE runs one generic plan that returns candidate target rows. Each row
> +consists of a super-row that contains all the columns needed by any of the
> +individual actions, plus a CTID and a TABLEOID junk columns. The CTID
> column is
>
> "plus a CTID and a TABLEOID junk columns" sounds a bit awkward?
>

Changed.

>
> +required to know if a matching target row was found or not and the
> TABLEOID
> +column is needed to find the underlying target partition, in case when the
> +target table is a partition table. If the CTID column is set we attempt to
> +activate WHEN MATCHED actions, or if it is NULL then we will attempt to
>
> "if it is NULL then" sounds wrong.
>

Made some adjustments.

>
> +activate WHEN NOT MATCHED actions. Once we know which action is activated
> we
> +form the final result row and apply only those changes.
> +
> XXX a great deal more documentation needs to be written here...
>
> Are we using tableoids in a similar way in other places?
>
>
AFAIK, no.

>
>
> +/*
> + * Given OID of the partition leaf, return the index of the leaf in the
> + * partition hierarchy.
> + */
> +int
> +ExecFindPartitionByOid(PartitionTupleRouting *proute, Oid partoid)
> +{
> + int i;
> +
> + for (i = 0; i < proute->num_partitions; i++)
> + {
> + if (proute->partition_oids[i] == partoid)
> + break;
> + }
> +
> + Assert(i < proute->num_partitions);
> + return i;
> +}
>
> Shouldn't we at least warn in a comment that this is O(N)? And document
> that it does weird stuff if the OID isn't found?
>

Yeah, added a comment. Also added a ereport(ERROR) if we do not find the
partition. There was already an Assert, but may be ERROR is better.

>
> Perhaps just introduce a PARTOID syscache?
>
>
Probably as a separate patch. Anything more than a handful partitions is
anyways known to be too slow and I doubt this code will add anything
material impact to that.

>
>
> diff --git a/src/backend/executor/nodeMerge.c
> b/src/backend/executor/nodeMerge.c
> new file mode 100644
> index 00000000000..0e0d0795d4d
> --- /dev/null
> +++ b/src/backend/executor/nodeMerge.c
> @@ -0,0 +1,575 @@
> +/*---------------------------------------------------------
> ----------------
> + *
> + * nodeMerge.c
> + * routines to handle Merge nodes relating to the MERGE command
>
> Isn't this file misnamed and it should be execMerge.c? The rest of the
> node*.c files are for nodes that are invoked via execProcnode /
> ExecProcNode(). This isn't an actual executor node.
>

Makes sense. Done. (Now that the patch is committed, I don't know if there
would be a rethink about changing file names. May be not, just raising that
concern)

>
> Might also be worthwhile to move the merge related init stuff here?
>

Done.

>
>
> What's the reasoning behind making this be an anomaluous type of
> executor node?
>
>
Didn't quite get that. I think naming of the file was bad (fixed now), but
I think it's a good idea to move the new code in a new file, from
maintainability as well as coverage perspective. If you've something very
different in mind, can you explain in more details?

>
>
> FWIW, I'd re-order this file so this routine is above
> ExecMergeMatched(), ExecMergeNotMatched(), easier to understand.
>

Done.

>
>
> + /*
> + * If there are not WHEN MATCHED actions, we are done.
> + */
> + if (mergeMatchedActionStates == NIL)
> + return true;
>
> Maybe I'm confused, but why is mergeMatchedActionStates attached to
> per-partition info? How can this differ in number between partitions,
> requiring us to re-check it below fetching the partition info above?
>
>
Because each partition may have a columns in different order, dropped
attributes etc. So we need to give treatment to the quals/targetlists. See
ON CONFLICT DO UPDATE for similar code.

>
> + /*
> + * Check for any concurrent update/delete operation which may have
> + * prevented our update/delete. We also check for situations where
> we
> + * might be trying to update/delete the same tuple twice.
> + */
> + if ((action->commandType == CMD_UPDATE && !tuple_updated) ||
> + (action->commandType == CMD_DELETE && !tuple_deleted))
> +
> + {
> + switch (hufd.result)
> + {
> + case HeapTupleMayBeUpdated:
> + break;
> + case HeapTupleInvisible:
> +
> + /*
> + * This state should never be reached since the
> underlying
> + * JOIN runs with a MVCC snapshot and should only
> return
> + * rows visible to us.
> + */
>
> Given EPQ, that reasoning isn't correct. I think this should still be
> unreachable, just not for the reason described here.
>
>
Agree. Updated the comment, but please check if it's satisfactory or you
would like to say something more/different.

>
>
> It seems fairly bad architecturally to me that we now have
> EvalPlanQual() loops in both this routine *and*
> ExecUpdate()/ExecDelete().
>
>
This was done after review by Peter and I think I like the new way too.
Also keeps the regular UPDATE/DELETE code paths least changed and let Merge
handle concurrency issues specific to it.

>
> +
> +/*
> + * Execute the first qualifying NOT MATCHED action.
> + */
> +static void
> +ExecMergeNotMatched(ModifyTableState *mtstate, EState *estate,
> + TupleTableSlot *slot)
> +{
> + PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
> + ExprContext *econtext = mtstate->ps.ps_ExprContext;
> + List *mergeNotMatchedActionStates = NIL;
> + ResultRelInfo *resultRelInfo;
> + ListCell *l;
> + TupleTableSlot *myslot;
> +
> + /*
> + * We are dealing with NOT MATCHED tuple. Since for MERGE, partition
> tree
>
> *the partition tree
>

Fixed.

>
>
> + * is not expanded for the result relation, we continue to work with
> the
> + * currently active result relation, which should be of the root of the
> + * partition tree.
> + */
> + resultRelInfo = mtstate->resultRelInfo;
>
> "should be"? "is", I hope? Given that it's referencing
> mtstate->resultRelInfo which is only set in one place...
>
>
Yeah, fixed.

>
>
> +/* flags for mt_merge_subcommands */
> +#define MERGE_INSERT 0x01
> +#define MERGE_UPDATE 0x02
> +#define MERGE_DELETE 0x04
>
> Hm, it seems a bit weird to define these file-local, given there's
> another file implementing a good chunk of merge.
>

Ok. Moved them execMerge.h, which made sense after I moved the
initialisation related code to execMerge.c

>
> + /*
> + * Initialize hufdp. Since the caller is only interested in the failure
> + * status, initialize with the state that is used to indicate
> successful
> + * operation.
> + */
> + if (hufdp)
> + hufdp->result = HeapTupleMayBeUpdated;
> +
>
> This signals for me that the addition of the result field to HUFD wasn't
> architecturally the right thing. HUFD is supposed to be supposed to be
> returned by heap_update(), reusing and setting it from other places
> seems like a layering violation to me.
>
>
I am not sure I agree. Sure we can keep adding more parameters to
ExecUpdate/ExecDelete and such routines, but I thought passing back all
information via a single struct makes more sense.

>
>
> diff --git a/src/backend/optimizer/plan/setrefs.c
> b/src/backend/optimizer/plan/setrefs.c
> index 69dd327f0c9..cd540a0df5b 100644
> --- a/src/backend/optimizer/plan/setrefs.c
> +++ b/src/backend/optimizer/plan/setrefs.c
> @@ -851,6 +851,60 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int
> rtoffset)
> fix_scan_list(root, splan->exclRelTlist, rtoffset);
> }
>
> + /*
> + * The MERGE produces the target rows by performing a right
>
> s/the MERGE/the MERGE statement/?
>
>
Fixed.

>
> + * join between the target relation and the source relation
> + * (which could be a plain relation or a subquery). The
> INSERT
> + * and UPDATE actions of the MERGE requires access to the
>
> same.
>
>
Fixed.

>
>
> +opt_and_condition:
>
> Renaming this to be a bit more merge specific seems like a good idea.
>

Renamed to opt_merge_when_and_condition

>
> +
> + /*
> + * Add a whole-row-Var entry to support references to "source.*".
> + */
> + var = makeWholeRowVar(rt_rte, rt_rtindex, 0, false);
> + te = makeTargetEntry((Expr *) var, list_length(*mergeSourceTargetList)
> + 1,
> + NULL, true);
>
> Why can't this properly dealt with by transformWholeRowRef() etc?
>
>
I just followed ON CONFLICT style. May be there is a better way, but not
clear how.

>
>
> + if (selectStmt && (selectStmt->valuesLists == NIL ||
> + selectStmt->sortClause != NIL ||
> + selectStmt->limitOffset != NULL ||
> + selectStmt->limitCount != NULL ||
> + selectStmt->lockingClause != NIL ||
> + selectStmt->withClause != NULL))
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("SELECT not allowed in MERGE
> INSERT statement")));
>
> + if (selectStmt && list_length(selectStmt->valuesLists)
> > 1)
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("Multiple VALUES clauses not
> allowed in MERGE INSERT statement")));
>
> Shouldn't this include an error position?
>
>
I will work on this as a separate follow-up patch. I tried adding location
to MergeAction, but that alone is probably not sufficient. So it was
turning out a slightly bigger patch than I anticipated.

>
> Why is this, and not building a proper executor node for merge that
> knows how to get the tuples, the right approach? We did a rough
> equivalent for matview updates, and I think it turned out to be a pretty
> bad plan.
>
>
I am still not sure why that would be any better. Can you explain in detail
what exactly you've in mind and how's that significantly better than what
we have today?

>
>
> + /*
> + * XXX MERGE is unsupported in various cases
> + */
> + if (!(pstate->p_target_relation->rd_rel->relkind == RELKIND_RELATION
> ||
> + pstate->p_target_relation->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("MERGE is not supported for this relation type")));
>
> Shouldn't this report what relation type the error talking about? It's
> not going to necessarily be obvious to the user. Also, errposition etc
> would be good.
>
>
Hmm, ok. There is getRelationTypeDescription() but otherwise I am surprised
that I couldn't find a ready way to get string representation of relation
kind. Am I missing something? Of course, we can write for the purpose, but
wanted to ensure we are not duplicating something already available.

>
>
> + /*
> + * Do basic expression transformation (same as a
> ROW()
> + * expr, but allow SetToDefault at top level)
> + */
> + exprList = transformExpressionList(pstate,
> + (List *)
> linitial(valuesLists),
> +
> EXPR_KIND_VALUES_SINGLE,
> + true);
> +
> + /* Prepare row for assignment to target table */
> + exprList = transformInsertRow(pstate, exprList,
> + istmt->cols,
> + icolumns, attrnos,
> + false);
> + }
>
> Can't we handle this with a littlebit less code duplication?
>

Hmm, yeah, that will be good. But given Tom's suggestions on the other
thread, I would like to postpone any refactoring here.

>
> typedef struct HeapUpdateFailureData
> {
> + HTSU_Result result;
> ItemPointerData ctid;
> TransactionId xmax;
> CommandId cmax;
> + LockTupleMode lockmode;
> } HeapUpdateFailureData;
>
>
> These new fields seem not really relateto HUFD, but rather just fields
> the merge code should maintain?
>

As I said, we can possibly track those separately. But they all arise as a
result of update/delete/lock failure. So I would prefer to keep them along
with other fields such as ctid and xmax/cmax. But if you or others insist,
I can move them and pass in/out of various function calls where we need to
maintain those.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Take-care-of-Andres-s-comments.patch application/octet-stream 35.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2018-04-05 06:02:08 Re: [HACKERS] Add support for tuple routing to foreign partitions
Previous Message Masahiko Sawada 2018-04-05 05:40:29 Re: [HACKERS] GUC for cleanup indexes threshold.