Re: [HACKERS] logical decoding of two-phase transactions

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
Cc: Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2017-11-24 03:05:26
Message-ID: CAMsr+YGX1z5t_MccrJxd4VJPgZKx17EsAQcAQ_1iJ4dKDp0LOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 November 2017 at 20:27, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
wrote:

>
> Is there any other locking scenario that we need to consider?
> Otherwise, are we all ok on this point being a non-issue for 2PC
> logical decoding?
>

Yeah.

I didn't find any sort of sensible situation where locking would pose
issues. Unless you're taking explicit LOCKs on catalog tables, you should
be fine.

There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog
relations I guess. Personally I put that in the "don't do that" box, but if
we really have to guard against it we could slightly expand the limits on
which txns you can PREPARE to any txn that has a strong lock on a catalog
relation.

> The issue is with a concurrent rollback of the prepared transaction.
> We need a way to ensure that
> the 2PC does not abort when we are in the midst of a change record
> apply activity.
>

The *reason* we care about this is that tuples created by aborted txns are
not considered "recently dead" by vacuum. They can be marked invalid and
removed immediately due to hint-bit setting and HOT pruning, vacuum runs,
etc.

This could create an inconsistent view of the catalogs if our prepared txn
did any DDL. For example, we might've updated a pg_class row, so we created
a new row and set xmax on the old one. Vacuum may merrily remove our new
row so there's no way we can find the correct data anymore, we'd have to
find the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC
we'll see the old pg_class tuple.

Similar issues apply for pg_attribute etc etc. We might try to decode a
record according to the wrong table structure because relcache lookups
performed by the plugin will report outdated information.

The sanest option here seems to be to stop the txn from aborting while
we're decoding it, hence Nikhil's suggestions.

> * We introduce two new booleans in the TwoPhaseState
> GlobalTransactionData structure.
> bool beingdecoded;
> bool abortpending;
>

I think it's premature to rule out the simple option of doing a LockGXact
when we start decoding. Improve the error "prepared transaction with
identifier \"%s\" is busy" to report the locking pid too. It means you
cannot rollback or commit a prepared xact while it's being decoded, but for
the intended use of this patch, I think that's absolutely fine anyway.

But I like your suggestion much more than the idea of taking a LWLock on
TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and
LWLocks held over arbitrary user plugin code. Not viable.

With your way we just have to take a LWLock once on TwoPhaseState when we
test abortpending and set beingdecoded. After that, during decoding, we can
do unlocked tests of abortpending, since a stale read will do nothing worse
than delay our response a little. The existing 2PC ops already take the
LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch
in a loop until beingdecoded was cleared, re-acquiring the LWLock and
re-checking each time it's woken. We should probably add a field there for
a waiter proc that wants its latch set, so 2pc ops don't usually have to
poll for decoding to finish. (Unless condition variables will help us here?)

However, let me make an entirely alternative suggestion. Should we add a
heavyweight lock class for 2PC xacts instead, and leverage the existing
infrastructure? We already use transaction locks widely after all. That
way, we just take some kind of share lock on the 2PC xact by xid when we
start logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT
PREPARED would acquire the same heavyweight lock in an exclusive mode
before grabbing TwoPhaseStateLock and doing their work.

That way we get automatic cleanup when decoding procs exit, we get wakeups
for waiters, etc, all for "free".

How practical is adding a lock class?

(Frankly I've often wished I could add new heavyweight lock classes when
working on complex extensions like BDR, too, and in an ideal world we'd be
able to register lock classes for use by extensions...)

3) Before starting decode of the next change record, we re-check if
> "abortpending" is set. If "abortpending"
> is set, we do not decode the next change record. Thus the abort is
> delay-bounded to a maximum of one change record decoding/apply cycle
> after we signal our intent to abort it. Then, we need to send ABORT
> (regular, not rollback prepared, since we have not sent "PREPARE" yet.
>

Just to be explicit, this means "tell the downstream the xact has aborted".
Currently logical decoding does not ever start decoding an xact until it's
committed, so it has never needed an abort callback on the output plugin
interface.

But we'll need one when we start doing speculative logical decoding of big
txns before they commit, and we'll need it for this. It's relatively
trivial.

> This abort_prepared callback will abort the dummy PREPARED query from
>
step (3) above. Instead of doing this, we could actually check if the
> 'GID' entry exists and then call ROLLBACK PREPARED on the subscriber.
> But in that case we can't be sure if the GID does not exist because of
> a rollback-during-decode-issue on the provider or due to something
> else. If we are ok with not finding GIDs on the subscriber side, then
> am fine with removing the DUMMY prepare from step (3).
>

I prefer the latter approach personally, not doing the dummy 2pc xact.
Instead we can just ignore a ROLLBACK PREPARED for a txn whose gid does not
exist on the downstream. I can easily see situations where we might
manually abort a txn and wouldn't want logical decoding to get perpetually
stuck waiting to abort a gid that doesn't exist, for example.

Ignoring commit prepared for a missing xact would not be great, but I think
it's sensible enough to ignore missing GIDs for rollback prepared.

We'd need a race-free way to do that though, so I think we'd have to
extend FinishPreparedTransaction and LockGXact with some kind of missing_ok
flag. I doubt that'd be controversial.

A couple of other considerations not covered in what you wrote:

- It's really important that the hook that decides whether to decode an
xact at prepare or commit prepared time reports the same answer each and
every time, including if it's called after a prepared txn has committed. It
probably can't look at anything more than the xact's origin replica
identity, xid, and gid. This also means we need to know the gid of prepared
txns when processing their commit record, so we can tell logical decoding
whether we already sent the data to the client at prepare-transaction time,
or if we should send it at commit-prepared time instead.

- You need to flush the syscache when you finish decoding a PREPARE
TRANSACTION of an xact that made catalog changes, unless it's immediately
followed by COMMIT PREPARED of the same xid. Because xacts between the two
cannot see changes the prepared xact made to the catalogs.

- For the same reason we need to ensure that the historic snapshot used to
decode a 2PC xact that made catalog changes isn't then used for subsequent
xacts between the prepare and commit. They'd see the uncommitted catalogs
of the prepared xact.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-11-24 03:54:02 Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding
Previous Message Amit Langote 2017-11-24 02:47:29 Re: [HACKERS] INSERT ON CONFLICT and partitioned tables