Re: [HACKERS] MERGE SQL Statement for PG11

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] MERGE SQL Statement for PG11
Date: 2018-02-12 07:09:28
Message-ID: CABOikdN4us_ExRc=q3CUnvHbKUiRJt-wmQX-cLaysC=rihd0=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 9, 2018 at 6:53 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Wed, Feb 7, 2018 at 7:51 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
> wrote:
> > I understand getting EPQ semantics right is very important. Can you
> please
> > (once again) summarise your thoughts on what you think is the *most*
> > appropriate behaviour? I can then think how much efforts might be
> involved
> > in that. If the efforts are disproportionately high, we can discuss if
> > settling for some not-so-nice semantics, like we apparently did for
> > partition key updates.
>
> I personally believe that the existing EPQ semantics are already
> not-so-nice. They're what we know, though, and we haven't actually had
> any real world complaints, AFAIK.
>

I agree.

>
> My concern is mostly just that MERGE manages to behave in a way that
> actually "provides a single SQL statement that can conditionally
> INSERT, UPDATE or DELETE rows, a task that would otherwise require
> multiple procedural language statements", as the docs put it. As long
> as MERGE manages to do something as close to that high level
> description as possible in READ COMMITTED mode (with our current
> semantics for multiple statements in RC taken as the baseline), then
> I'll probably be happy.
>

IMO it will be quite hard, if not impossible, to guarantee the same
semantics to a single statement MERGE and multi statement
UPDATE/DELETE/INSERT in RC mode. For example, the multi statement model
will execute each statement with a new MVCC snapshot and hence the rows
visible to individual statement may vary. Whereas in MERGE, everything runs
with a single snapshot. There could be other such subtle differences.

>
> Some novel new behavior -- "EPQ with a twist"-- is clearly necessary.
> I feel a bit uneasy about it because anything that anybody suggests is
> likely to be at least a bit arbitrary (EPQ itself is kind of
> arbitrary). We only get to make a decision on how "EPQ with a twist"
> will work once, and that should be a decision that is made following
> careful deliberation. Ambiguity is much more likely to kill a patch
> than a specific technical defect, at least in my experience. Somebody
> can usually just fix a technical defect.
>

While I agree, I think we need to make these decisions in a time bound
fashion. If there is too much ambiguity, then it's not a bad idea to settle
for throwing appropriate errors instead of providing semantically wrong
answers, even in some remote corner case.

>
>
>
> > TBH I did not consider partitioning any less complex and it was indeed
> very
> > complex, requiring at least 3 reworks by me. And from what I understood,
> it
> > would have been a blocker too. So is subquery handling and RLS. That's
> why I
> > focused on addressing those items while you and Simon were still debating
> > EPQ semantics.
>
> Sorry if I came across as dismissive of that effort. That was
> certainly not my intention. I am pleasantly surprised that you've
> managed to move a number of things forward rather quickly.
>
> I'll rephrase: while it would probably have been a blocker in theory
> (I didn't actually weigh in on that), I doubted that it would actually
> end up doing so in practice (and it now looks like I was right to
> doubt that, since you got it done). It was a theoretical blocker, as
> opposed to an open item that could drag on indefinitely despite
> everyone's best efforts. Obviously details matter, and obviously there
> are a lot of details to get right outside of RC semantics, but it
> seems wise to focus on the big risk that is EPQ/RC conflict handling.
>

Ok. I am now back from holidays and I will too start thinking about this.
I've also requested a colleague to help us with comparing it against
Oracle's behaviour. N That's not a gold standard for us, but knowing how
other major databases handle RC conflicts, is not a bad idea.

I see the following important areas and as long as we have a consistent and
coherent handling of these cases, we should not have difficulty agreeing on
a outcome.

1. Concurrent UPDATE does not affect MATCHED case. The WHEN conditions may
or may not be affected.
2. Concurrently UPDATEd tuple fails the join qual and the current source
tuple no longer matches with the updated target tuple that the EPQ is set
for. It matches no other target tuple either. So a MATCHED case is turned
into a NOT MATCHED case.
3. Concurrently UPDATEd tuple fails the join qual and the current source
tuple no longer matches with the updated target tuple that the EPQ is set
for. But it matches some other target tuple. So it's still a MATCHED case,
but with different target tuple(s).
4. Concurrent UPDATE/INSERT creates a matching target tuple for a source
tuple, thus turning a NOT MATCHED case to a MATCHED case.
5. Concurrent DELETE turns a MATCHED case into NOT MATCHED case

Any other case that I am missing? Assuming all cases are covered, what
should we do in each of these cases, so that there is no or very little
ambiguity and the outcome seems consistent (at the very least as far as
MERGE goes and hopefully with regular UPDATE/DELETE handling)

I think #1 is pretty straight forward. We should start from the top,
re-evaluate the WHEN conditions again and execute the first matching action.

For #2, it seems natural that we skip the MATCHED actions and execute the
NOT MATCHED action for the current source tuple. But the trouble is how to
differentiate between #2 and #3. I'm not sure if we can really distinguish
between these cases i.e. given the current source tuple, does another
matching target tuple exists? If another matching target tuple exists, we
must not invoke the NOT MATCHED action. Otherwise we might end up executing
NOT MATCHED as well as MATCHED action for the same source tuple, which
seems weird.

#4 looks similar to INSERT ON CONFLICT and one may argue that we should
detect concurrent inserts/updates and execute the MATCHED action. But I
don't know if that can be done in a reasonable way. It will probably
require us to have a primary key on the table to detect those conflicts. I
think we should just let the operation fail, like a regular INSERT.

It seems natural that #5 should skip the MATCHED action and instead execute
the first satisfying NOT MATCHED action. But it's outcome may depend on how
we handle #2 and #3 so that they are all consistent. If we allow #2 and #3
to error out, whenever there is ambiguity, we should do the same for #5.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikhil Sontakke 2018-02-12 08:06:16 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Andrey Borodin 2018-02-12 06:46:31 [WiP] GiST intrapage indexing