Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: wangchuanting <wangchuanting(at)huawei(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog
Date: 2017-06-12 03:17:08
Message-ID: CAB7nPqRRuR6ig5c5JWNKEdifHY9-SrJ4eWwNy7DC9sAkGzwJzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Sun, Jun 11, 2017 at 12:24 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Alvaro Herrera wrote:
>
> For some reason I sent a preliminary version of the email I was writing.
> Some additional thoughts that were supposed to be in there:

No problem. Thanks for the input.

>> I also agree that restoreTwoPhaseData doesn't need to hold the lock,
>> since we don't expect anything running concurrently, but that it's okay
>> to hold it and makes the whole thing simpler.
>
> It's okay to hold the lock for the whole duration of the function.

Yup.

>> We could try to apply an equivalent rationale to
>> RecoverPreparedTransactions: even though we have already been running in
>> HOT standby mode for a while, there's no possibility of concurrent
>> activity either: since we exited recovery, master cannot write any more
>> 2PC xacts, and we haven't yet set the flag that lets open sessions write
>> WAL. However, it seems mildly dangerous to run the bottom sections of
>> the loop with the lock held, since it acquires other lwlocks and
>> generally does a lot of crap.
>
> Therefore, let's adopt your idea of acquiring the lock only for the
> specific low-level calls that require it. But the comment atop the loop
> needs a lot of work to explain why it's doing that (i.e. how come it's
> reading TwoPhaseState without a lock), as in my proposed patch.
>
> (At first, I didn't really like this very much and wanted to add more
> locking to that function, but it turned quite messy, so I back-tracked).

Yes, I got the same temptation but give up as well because of the
unnecessary complications. GXactLoadSubxactData() does nothing fancy,
but my worries are in ProcessRecords() and PostPrepare_Twophase(),
particularly if they do more complicated and fancy stuff in the future
for whatever reason. The comment you have added is this one:
/*
- * Don't need a lock in the recovery phase.
+ * It's okay to access TwoPhaseState without a lock here: recovery is
+ * finished (so if we were a standby, there's no master that can prepare
+ * transactions anymore), and we haven't yet set WAL as open for writes,
+ * so local existing backends, if any, cannot do so either.
We could use a
+ * coding pattern similar to restoreTwoPhaseData, i.e., run
the whole loop
+ * with the lock held; but this loop is far more complex, so
instead only
+ * grab the lock while calling the low-level functions that require it.
*/
I have reworked the comment as follows:
/*
- * Don't need a lock in the recovery phase.
+ * It is fine to access TwoPhaseState without a lock here: recovery is
+ * finished (so if we were a standby, there's no master that can prepare
+ * transactions anymore), and we haven't yet set WAL as open for writes,
+ * so local existing backends, if any, cannot do so either. We could use a
+ * coding pattern similar to restoreTwoPhaseData, i.e., run the whole loop
+ * with the lock held; but this loop is far more complex, so instead only
+ * grab the lock while calling the low-level functions working directly on
+ * manipulating the two-phase state data. Functions working directly on
+ * PGPROC entries linked with the two-phase transaction work with other
+ * types of locks but we don't want to complicate that more than necessary.
*/
The v5 attached does not have any other changes than what you did in
v4 (plus typos noticed).

- if (info == XLOG_XACT_COMMIT || info == XLOG_XACT_COMMIT_PREPARED)
+ if (info == XLOG_XACT_COMMIT)
{
xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);
xl_xact_parsed_commit parsed;
No objections to the shuffling done here.

+ * An exclusive lock on TwoPhaseStateLock is taken for the duration of the
+ * scan of TwoPhaseState data as its may be updated.
Typo noticed here => s/its/it/.
--
Michael

Attachment Content-Type Size
2pc-redo-lwlock-fix-v5.patch application/octet-stream 12.9 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2017-06-12 04:11:26 Re: BUG #14701: pg_dump fails to dump pg_catalog schema
Previous Message Tom Lane 2017-06-12 00:19:52 Re: BUG #14701: pg_dump fails to dump pg_catalog schema

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-06-12 04:04:23 Re: RTE_NAMEDTUPLESTORE, enrtuples and comments
Previous Message Jeevan Ladhe 2017-06-12 02:59:30 Re: Adding support for Default partition in partitioning