Re: MERGE SQL Statement

From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: MERGE SQL Statement
Date: 2008-04-17 02:18:53
Message-ID: 200804162218.53963.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday 16 April 2008 14:58, Simon Riggs wrote:
> I've analysed various flavours of MERGE command to understand and
> propose what we should use for PostgreSQL.
>
> The results aren't what you'd expect from a quick flick through the
> standard, so lets look at my main concerns:
>
> 1. The simplest syntax is for SQL:2003. The syntax for DB2, SQL Server
> and Oracle is more complex, with SQL:2008(final draft) being very
> similar to DB2 and SQL Server, so unlikely to be a point of contention
> in the standard. I suggest we go with the latter, but yes, its still in
> draft (yawn).
>
> 2. MySQL and Teradata have their own syntax for the row-oriented Upsert
> operation. Both of those are more useful (IMHO) than MERGE for OLTP
> apps, while MERGE is very useful for bulk data loads. I'm open to the
> idea that we do something like this in addition to MERGE.
>
> 3. The way MERGE works is to define a left outer join between source and
> target, then specify a series of WHEN clauses that may or may not apply.
> It **isn't** just a simple Update/Insert and so much of what we have
> discussed previously goes straight in the trash. AFAICS the way it is
> specified to work it would be fairly straightforward to cause race
> conditions and failures when using multiple concurrent MERGE statements.
>
> General Example of the recommended syntax for PostgreSQL
>
> MERGE INTO Stock S /* target */
>
> USING DailySales DS /* source table */
>
> ON S.Item = DS.Item /* left outer join source to target */
>
> WHEN MATCHED AND (QtyOnHand - QtySold = 0) THEN
>
> /* delete item if no stock remaining */
> DELETE
>
> WHEN MATCHED THEN /* No AND clause, so drop thru */
>
> /* update value if some remains */
> UPDATE SET QtyOnHand = QtyOnHand - QtySold
>
> WHEN NOT MATCHED THEN
>
> /* insert a row if the stock is new */
> INSERT VALUES (Item, QtySold)
> ;
>
> So lets look at the syntaxes and then review how it might work.
>
> SYNTAX
> ======
> SQL:2003
> --------
> MERGE INTO target [AS correlation-name]
> USING [table-ref | subquery]
> ON <search-condition>
> [WHEN MATCHED THEN MergeUpdate]
> [WHEN NOT MATCHED THEN MergeInsert]
>
> Oracle 11g
> ----------
> MERGE INTO target [AS correlation-name]
> USING [table-ref | subquery]
> ON <search-condition>
> [WHEN MATCHED THEN MergeUpdate
> WHERE <where-clause> DELETE WHERE <where-clause>]
> [WHEN NOT MATCHED THEN MergeInsert
> WHERE <where-clause>]
>
> Differences from SQL:2003 are
> * Update and Insert have WHERE clauses on them
> * Oracle allows multiple WHEN ... WHERE clauses
> * Oracle allows an error logging clause also
> * optional DELETE statement as part of the UPDATE, so you can only
> DELETE what you update (yeh, really)
> * WHEN MATCHED/WHEN NOT MATCHED must be in fixed order, only
>
> IBM DB2
> -------
> MERGE INTO target [AS correlation-name]
> USING [table-ref | subquery]
> ON <search-condition>
> [WHEN MATCHED [AND <where-clause>] THEN MergeUpdate | MergeDelete]
> [WHEN NOT MATCHED [AND <where-clause>] THEN MergeInsert | SignalClause]
> [ELSE IGNORE]
>
> Differences from SQL:2003 are
> * Update and Insert have AND clauses on them (like WHERE...)
> * DB2 allows multiple WHEN ... AND clauses
> * DELETE is also a full-strength option, not part of the MergeUpdate
> clause as it is in Oracle
> * DB2 allows a SIGNAL statement, similar to RAISE
> * ELSE IGNORE is an optional syntax, which does nothing
>
> SQL Server 2008
> ---------------
>
> MERGE [INTO] target [AS correlation-name]
> USING [table-ref | subquery]
> ON <search-condition>
> [WHEN MATCHED [AND <where-clause>] THEN MergeUpdate | MergeDelete]
> [WHEN NOT MATCHED [AND <where-clause>] THEN MergeInsert]
>
> Differences from SQL:2003 are
> * Update and Insert have AND clauses on them (like WHERE...)
> * DB2 allows multiple WHEN ... AND clauses
> * DELETE is also a full-strength option, not part of the MergeUpdate
> clause as it is in Oracle
>
> SQL:2008
> --------
>
> MERGE INTO target [AS correlation-name]
> USING [table-ref | subquery]
> ON <search-condition>
> [WHEN MATCHED [AND <where-clause>] THEN MergeUpdate]
> [WHEN NOT MATCHED [AND <where-clause>] THEN MergeInsert]
>
> Differences from SQL:2003 are
> * Update and Insert have AND clauses on them (like WHERE...)
> * Allows multiple WHEN ... AND clauses
>
> Alternate Syntax
> ----------------
>
> MySQL supports
> * REPLACE INTO
> * INSERT ... ON DUPLICATE KEY UPDATE ...
>
> Teradata supports
> * UPDATE ... ELSE INSERT ...
> * MERGE with an additional error logging clause
>
> The Teradata and Oracle error logging clauses are very cute and I
> suggest we do something similar for COPY, at least.
>
> Proposed Syntax for PostgreSQL
> ==============================
>
> MERGE INTO table [[AS] alias]
> USING [table-ref | query]
> ON join-condition
> [WHEN MATCHED [AND condition] THEN MergeUpdate | DELETE]
> [WHEN NOT MATCHED [AND condition] THEN MergeInsert]
>
> MergeUpdate is
> UPDATE SET { column = { expression | DEFAULT } |
> ( column [, ...] ) = ( { expression | DEFAULT } [, ...] ) }
> [, ...]
> (yes, there is no WHERE clause here)
>
> MergeInsert is
> INSERT [ ( column [, ...] ) ]
> { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] )
> [, ...]}
> (no subquery allowed)
>
>
> Notes and behaviours
> --------------------
>
> * It is possible for concurrent MERGE statements to cause duplicate
> INSERT violations because of a race condition between when we check
> whether the row is matching/not matching and when we apply the
> appropriate WHEN clause, if any. This is just the same as what we do now
> with try-UPDATE-then-INSERT logic. (This seems to end the discussion
> about whether we do inserts/updates first because the matching test is
> always performed before we take the action; or perhaps it means we don't
> like MERGE as much as we did before and would prefer alternate
> syntaxes...). Maybe we could avoid some problems by applying
> heap_lock_tuple() to each matched row, so we know it will stay matched
> while we evaluate the WHEN clauses? Maybe not.
>

I would have thought that MVCC + the command taking a row lock on each tuple
that matched a update/delete case would protect us from many race
scenarios... at least as protected as we are now. ISTM you can't protect well
for INSERT operations.

> * USING query can be a VALUES clause if we wish to do single/few row
> operations, so MERGE can be used for bulk-loading and OLTP
>
> * There is no RETURNING option for MERGE, nor for any INSERT/UPDATE
> sub-clauses

Is there a reason for this? A returning for any insert/updated rows would be
great, especially if your doing single row merges via the values clause.

> * WHERE CURRENT OF cursor is not supported anywhere
> * The join can't be recursive, so no WITH support (common expressions,
> i.e. non-recursive WITH are supported by SQLServer 2008)
> * conditions would not allow sub-selects
>

same question on both of these, is there some technical reason for this? I'd
imagine people would like both of these options.

> * MERGE would work on base tables only, just like COPY
> * Changes are made only to that single table
> * Cannot update a column mentioned in the ON clause cos that would make
> my head hurt too much
>

Hmm... if the matching test is always done first, istm you could then update
the columns that matched on, since you don't really care about the value of
the on column once you have the row.

> * MERGE will perform a left outer join between source on left and target
> on right. There must be no more than 1 row from table-ref for each row
> in the table. Each row in the table can only be updated once during each
> MERGE statement. Each non-matching row in the table-ref will result in
> one INSERT into table.
>
> * WHEN clauses are tested in the order specified. If the AND condition
> returns false then we skip onto the next WHEN clause. We stop once a
> WHEN clause activates, so only *one* action is ever activated for each
> row.
>
> * AND clauses need not form a complete set, i.e. it is possible that no
> action will result. It is also possible that some WHEN clauses will
> never activate because of the execution order; we would not try to
> prevent this, just document it as a possible user error.
>

Just curious if any of these behaviors come from the spec? or maybe from
other databases? they don't seem unreasonable in general though.

> * MERGE will respect Triggers, but not Rules since the rules behaviour
> is roughly orthogonal to the WHEN clauses

Should there be a new rule option? ie. ON MERGE rules ?

> * MERGE fires UPDATE and INSERT triggers according to which WHEN clause
> is activated (if any)
>
> * It's unclear whether MERGE should activate statement-level triggers,
> or not. Few of the above sources are explicit either way on this point.
> DB2 always runs both UPDATE and INSERT statement-level triggers, whether
> or not rows have been changed; I would suggest we do that also for
> before triggers. For after statement triggers I would suggest that we
> track how many updates and inserts are caused and if updates > 0 then we
> activate the after statement for update triggers, and if inserts > 0
> then we activate the after statement for insert triggers. If a statement
> level trigger is activated by both update and insert then it would be
> possible for both TRIGGER_FIRED_BY_UPDATE() and
> TRIGGER_FIRED_BY_DELETE() to be true (for statement level triggers
> only), which would be a change from what we do now, even if the old
> behaviour was not explicitly mutually exclusive. In both cases I suggest
> we run Update triggers before Insert triggers consistently for both
> before and after statement triggers.
>

It would seem wierd that a before update statement level trigger would fire
and not the matching after update statement level trigger. I can probably
live with this behavior though, but seems like a note worth documenting.

> * The number of rows changed should be (inserts + updates) which should
> be < number of rows returned by table-ref. It would be good to get
> access to the number of rows inserted and updated, so I propose that we
> return a NOTICE statement with this information.
>

How do you handle deletes? IE. If I merge two tables and I end up with no
updates or inserts but 100 deletes, is the number of affected rows 0 ?

> * The way MERGE is specified, the internals design seems to fall out
> fairly clearly:
> - set up a left outer join between source and target, with a junk
> attribute that might be NULL to indicate NOT MATCHED. Let planner
> optimise that as it wishes, nothing special needed
> - in ExecutePlan have a new operation type of CMD_MERGE, which then
> allows us to apply the WHEN clauses to decide what kind of final
> operation will result for that tuple (U, D or I)
>
> I'm planning to agree the syntax and write regression tests first, so we
> all agree the behaviour we are aiming towards.
>
> Thoughts? (Apart from jokes about having a WHY clause...)
>

Nice work, hope my comments will be helpful.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2008-04-17 02:49:00 Re: Proposed patch - psql wraps at window width
Previous Message Andrew Dunstan 2008-04-17 01:37:40 Re: How to submit a patch