Re: logical decoding of two-phase transactions

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: logical decoding of two-phase transactions
Date: 2017-03-27 13:29:02
Message-ID: CAMsr+YGm-pft3CtxbzP0SOFj6rN62Loh43hj2oYjuYCVbDLXFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27 March 2017 at 17:53, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:

> I’m heavily underestimated amount of changes there, but almost finished
> and will send updated patch in several hours.

Oh, brilliant! Please post whatever you have before you knock off for
the day anyway, even if it's just a WIP, so I can pick it up tomorrow
my time and poke at its tests etc.

I'm in Western Australia +0800 time, significantly ahead of you.

> Done.
[snip]
> Also done.

Great, time is short so that's fantastic.

> I’ve extended test, but it is good to have some more.

I don't mind writing tests and I've done quite a bit with TAP now, so
happy to help there.

>> Some special care is needed for the callback that decides whether to
>> process a given xact as 2PC or not. It's called before PREPARE
>> TRANSACTION to decide whether to decode any given xact at prepare time
>> or wait until it commits. It's called again at COMMIT PREPARED time if
>> we crashed after we processed PREPARE TRANSACTION and advanced our
>> confirmed_flush_lsn such that we won't re-process the PREPARE
>> TRANSACTION again. Our restart_lsn might've advanced past it so we
>> never even decode it, so we can't rely on seeing it at all. It has
>> access to the xid, gid and invalidations, all of which we have at both
>> prepare and commit time, to make its decision from. It must have the
>> same result at prepare and commit time for any given xact. We can
>> probably use a cache in the reorder buffer to avoid the 2nd call on
>> commit prepared if we haven't crashed/reconnected between the two.
>
> Good point. Didn’t think about restart_lsn in case when we are skipping this
> particular prepare (filter_prepared() -> true, in my terms). I think that should
> work properly as it use the same code path as it was before, but I’ll look at it.

I suspect that's going to be fragile in the face of interleaving of
xacts if we crash between prepare and commit prepared. (Apologies if
the below is long or disjointed, it's been a long day but trying to
sort thoughts out).

Consider ("SSU" = "standby status update"):

0/050 xid 1 BEGIN
0/060 xid 1 INSERT ...

0/070 xid 2 BEGIN
0/080 xid 2 INSERT ...

0/090 xid 3 BEGIN
0/095 xid 3 INSERT ...
0/100 xid 3 PREPARE TRANSACTION 'x' => sent to client [y/n]?
SSU: confirmed_flush_lsn = 0/100, restart_lsn 0/050 (if we sent to client)

0/200 xid 2 COMMIT => sent to client
SSU: confirmed_flush_lsn = 0/200, restart_lsn 0/050

0/250 xl_running_xacts logged, xids = [1,3]

[CRASH or disconnect/reconnect]

Restart decoding at 0/050.

skip output of xid 3 PREPARE TRANSACTION @ 0/100: is <= confirmed_flush_lsn
skip output of xid 2 COMMIT @ 0/200: is <= confirmed_flush_lsn

0/300 xid 2 COMMIT PREPARED 'x' => sent to client, confirmed_flush_lsn
is > confirmed_flush_lsn

In the above, our problem is that restart_lsn is held down by some
other xact, so we can't rely on it to tell us if we replayed xid 3 to
the output plugin or not. We can't use confirmed_flush_lsn either,
since it'll advance at xid 2's commit whether or not we replayed xid
3's prepare to the client.

Since xid 3 will still be in xl_running_xacts when prepared, when we
recover SnapBuildProcessChange will return true for its changes and
we'll (re)buffer them, whether or not we landed up sending to the
client at prepare time. Nothing much to be done about that, we'll just
discard them when we process the prepare or the commit prepared,
depending on where we consult our filter callback again.

We MUST ask our filter callback again though, before we test
SnapBuildXactNeedsSkip when processing the PREPARE TRANSACTION again.
Otherwise we'll discard the buffered changes, and if we *didn't* send
them to the client already ... splat.

We can call the filter callback again on xid 3's prepare to find out
"would you have replayed it when we passed it last time". Or we can
call it when we get to the commit instead, to ask "when called last
time at prepare, did you replay or not?" But we have to consult the
callback. By default we'd just skip ReorderBufferCommit processing for
xid 3 entirely, which we'll do via the SnapBuildXactNeedsSkip call in
DecodeCommit when we process the COMMIT PREPARED.

If there was no other running xact when we decoded the PREPARE
TRANSACTION the first time around (i.e. xid 1 and 2 didn't exist in
the above), and if we do send it to the client at prepare time, I
think we can safely advance restart_lsn to the most recent
xl_running_xacts once we get replay confirmation. So we can pretend we
already committed at PREPARE TRANSACTION time for restart purposes if
we output at PREPARE TRANSACTION time, it just doesn't help us with
deciding whether to send the buffer contents at COMMIT PREPARED time
or not.

TL;DR: we can't rely on restart_lsn or confirmed_flush_lsn or
xl_running_xacts, we must ask the filter callback when we (re)decode
the PREPARE TRANSACTION record and/or at COMMIT PREPARED time.

This isn't a big deal. We just have to make sure we consult the filter
callback again when we decode an already-confirmed prepare
transaction, or at commit prepared time if we don't know what its
result was already.

>> This proposal does not provide a way to safely decode a 2pc xact that
>> made catalog changes which may be aborted while being decoded. The
>> plugin must lock such an xact so that it can't be aborted while being
>> processed, or defer decoding until commit prepared. It can use the
>> invalidations for the commit to decide.
>
> I had played with that two-pass catalog scan and it seems to be
> working but after some time I realised that it is not useful for the main
> case when commit/abort is generated after receiver side will answer to
> prepares. Also that two-pass scan is a massive change in relcache.c and
> genam.c (FWIW there were no problems with cache, but some problems
> with index scan and handling one-to-many queries to catalog, e.g. table
> with it fields)

Yeah, it was the intrusiveness I was concerned about. I don't think we
can even remotely hope to do that for Pg 10.

> Finally i decided to throw it and switched to filter_prepare callback and
> passed there txn structure to allow access to has_catalog_changes field.

I think that's how we'll need to go.

Plugins can either defer processing on all 2pc xacts with catalog
changes, or lock the xact. It's not perfect, but it's far from
unreasonable when you consider that plugins would only be locking 2pc
xacts where they expect the result of logical decoding to influence
the commit/abort decision, so we won't be doing a commit/abort until
we finish decoding the prepare anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-03-27 13:33:43 Re: WIP: Faster Expression Processing v4
Previous Message Daniel Verite 2017-03-27 13:28:51 Re: PATCH: Batch/pipelining support for libpq