Re: Assertion failure in WaitForWALToBecomeAvailable state machine

From: Noah Misch <noah(at)leadboat(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date: 2022-09-13 03:22:56
Message-ID: 20220913032256.GB1438180@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 12, 2022 at 05:58:08PM +0530, Bharath Rupireddy wrote:
> On Mon, Sep 12, 2022 at 12:30 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > On Sat, Sep 10, 2022 at 07:52:01AM +0530, Bharath Rupireddy wrote:
> > > Just for the records - there's another report of the assertion failure
> > > at [1], many thanks to Kyotaro-san for providing consistent
> > > reproducible steps.
> > >
> > > [1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com

I plan to use that message's patch, because it guarantees WALRCV_STOPPED at
the code location being changed. Today, in the unlikely event of
!WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code
proceeds without waiting for WALRCV_STOPPED. I think WALRCV_STOPPING can't
happen there. Even if WALRCV_WAITING also can't happen, it's a simplicity win
to remove the need to think about that. Dilip Kumar and Nathan Bossart also
stated support for that design.

If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I
perhaps should back-patch the change to released versions. Does anyone know
whether one or both can happen?

> > FWIW, I
> > find better the approach taken by Horiguchi-san in [1] to reset the
> > state before we attempt to read WAL from the archives *or* pg_wal,
> > after we know that the last source has failed, where we know that we
> > are not streaming yet (but recovery may be requested soon).
> >
> > [1]: https://www.postgresql.org/message-id/20220214.171428.735280610520122187.horikyota.ntt@gmail.com
>
> I think the fix at [1] is wrong. It basically resets
> InstallXLogFileSegmentActive for both XLOG_FROM_ARCHIVE and
> XLOG_FROM_PG_WAL, remember that we don't need WAL files preallocation
> and recycling just for archive. Moreover, if we were to reset just for
> archive there, it's too aggressive, meaning for every WAL record, we
> take ControlFileLock in exclusive mode to reset it.
>
> IMO, doing it once when we switch the source to archive is the correct
> way because it avoids frequent ControlFileLock acquisitions and makes
> the fix more intuitive. And we have a comment saying why we reset
> InstallXLogFileSegmentActive, if required we can point to the commit
> cc2c7d6 in the comments.

The startup process maintains the invariant that archive recovery and
streaming recovery don't overlap in time; one stops before the other starts.
As long as it achieves that, I'm not aware of a cause to care whether the flag
reset happens when streaming ends vs. when archive recovery begins. If the
startup process develops a defect such that it ceases to maintain that
invariant, "when archive recovery begins" does have the
advantage-or-disadvantage of causing a failure in non-assert builds. The
other way gets only an assertion failure. Michael Paquier or Bharath
Rupireddy, did you envision benefits other than that one?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhang Mingli 2022-09-13 03:35:37 Removed unused param isSlice of function transformAssignmentSubscripts
Previous Message Amit Langote 2022-09-13 03:17:29 Re: cataloguing NOT NULL constraints