Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: wangchuanting <wangchuanting(at)huawei(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog
Date: 2017-06-13 23:08:12
Message-ID: 20170613230812.uvlhjlvfbad7njb7@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Michael Paquier wrote:

> The current coding is actually safe because the checkpointer does not
> remove or add any 2PC entry in the array while holding
> TwoPhaseStateLock, it just updates some values that need to be read
> and/or written while holding the lock. Well, to be honest, HEAD is
> wrong because it can read a flag value while the checkpointer updates
> it, and the patch is careful to change that to be correct. The wrong
> part is when calling ProcessTwoPhaseBuffer() in
> RecoverPreparedTransactions() which accesses gxact->ondisk and
> prepare_start_lsn without locking things.

Honestly I don't like this rationale very much. Even if doing the
unlocked access is safe today, it looks like installing a landmine for
the future, and for what? I don't think there's a lot to be gained:
RecoverPreparedTransactions only runs once in the life of a server, and
CheckPointTwoPhase is supposed to have a very short runtime (per
explanation in comments therein). It seems better to me to continue our
tradition of using the appropriate locks instead of playing a dangerous
game.

So I propose that RecoverPreparedTransactions grabs exclusive lock at
the top, and only the bottom part of the loop is done unlocked, which
AFAICS should be safe. (MarkAsPrepared gained a boolean argument
indicating that caller already holds lock).

Here's a patch along those lines. The full testsuite is running now,
but the recovery tests pass fine.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
2pc-redo-lwlock-fix-v6.patch text/plain 13.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2017-06-14 01:03:11 Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog
Previous Message Michael Paquier 2017-06-13 20:38:52 Re: [BUGS] Invalid WAL segment size. Allowed values are 1,2,4,8,16,32,64

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-06-14 01:03:11 Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog
Previous Message Stephen Frost 2017-06-13 22:29:20 Re: WIP: Data at rest encryption