Re: Prepared transaction releasing locks before deregistering its GID

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Oleksii Kliukin <alexk(at)hintbits(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prepared transaction releasing locks before deregistering its GID
Date: 2019-02-20 06:14:07
Message-ID: CAD21AoAYLxWaH0v=8Xve53WUbhiQ+WWZ1GJd-5AkidxOshOiUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 19, 2019 at 12:27 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Mon, Feb 18, 2019 at 05:05:13PM +0100, Oleksii Kliukin wrote:
> > That looks like a race condition to me. What happens is that another
> > transaction with the name identical to the running one can start and proceed
> > to the prepare phase while the original one commits, failing at last instead
> > of waiting for the original one to finish.
>
> It took me 50 clients and a bit more than 20 seconds, but I have been
> able to reproduce the problem with one error. Thanks for the
> reproducer. This is indeed a race condition with 2PC.
>
> > By looking at the source code of FinishPreparedTransaction() I can see the
> > RemoveGXact() call, which removes the prepared transaction from
> > TwoPhaseState->prepXacts. It is placed at the very end of the function,
> > after the post-commit callbacks that clear out the locks held by the
> > transaction. Those callbacks are not guarded by the TwoPhaseStateLock,
> > resulting in a possibility for a concurrent session to proceed will
> > MarkAsPreparing after acquiring the locks released by them.
>
> Hm, I see. Taking a breakpoint just after ProcessRecords() or putting
> a sleep there makes the issue plain. The same issue can happen with
> predicate locks.
>
> > I couldn’t find any documentation on the expected outcome in cases like
> > this, so I assume it might not be a bug, but an undocumented behavior.
>
> If you run two transactions in parallel using your script, the second
> transaction would wait at LOCK time until the first transaction
> releases its locks with the COMMIT PREPARED.
>
> > Should I go about and produce a patch to put a note in the description of
> > commit prepared, or is there any interest in changing the behavior to avoid
> > such conflicts altogether (perhaps by holding the lock throughout the
> > cleanup phase)?
>
> That's a bug, let's fix it. I agree with your suggestion that we had
> better not release the 2PC lock using the callbacks for COMMIT
> PREPARED or ROLLBACK PREPARED until the shared memory state is
> cleared. At the same time, I think that we should be smarter in the
> ordering of the actions: we care about predicate locks here, but not
> about the on-disk file removal and the stat counters. One trick is
> that the post-commit callback calls TwoPhaseGetDummyProc() which would
> try to take TwoPhaseStateLock which needs to be applied so we need
> some extra logic to not take a lock in this case. From what I can see
> this is older than 9.4 as the removal of the GXACT entry in shared
> memory and the post-commit hooks are out of sync for a long time :(
>
> Attached is a patch that fixes the problem for me. Please note the
> safety net in TwoPhaseGetGXact().

Thank you for working on this.

@@ -811,6 +811,9 @@ TwoPhaseGetGXact(TransactionId xid)
static TransactionId cached_xid = InvalidTransactionId;
static GlobalTransaction cached_gxact = NULL;

+ Assert(!lock_held ||
+ LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
+
/*
* During a recovery, COMMIT PREPARED, or ABORT PREPARED,
we'll be called
* repeatedly for the same XID. We can save work with a simple cache.
@@ -818,7 +821,8 @@ TwoPhaseGetGXact(TransactionId xid)
if (xid == cached_xid)
return cached_gxact;

- LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+ if (!lock_held)
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);

for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{

It seems strange to me, why do we always require an exclusive lock
here in spite of acquiring a shared lock?

-----
@@ -854,7 +859,7 @@ TwoPhaseGetGXact(TransactionId xid)
BackendId
TwoPhaseGetDummyBackendId(TransactionId xid)
{
- GlobalTransaction gxact = TwoPhaseGetGXact(xid);
+ GlobalTransaction gxact = TwoPhaseGetGXact(xid, false);

return gxact->dummyBackendId;
}

TwoPhaseGetDummyBackendId() is called by
multixact_twophase_postcommit() which is called while holding
TwoPhaseStateLock in exclusive mode in FinishPreparedTransaction().
Since we cache the global transaction when we call
lock_twophase_commit() I couldn't produce issue but it seems to have a
potential of locking problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2019-02-20 06:20:37 RE: Speed up transaction completion faster after many relations are accessed in a transaction
Previous Message Kato, Sho 2019-02-20 06:11:22 RE: Speeding up creating UPDATE/DELETE generic plan for partitioned table into a lot