Re: Assertion failure in WaitForWALToBecomeAvailable state machine

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date: 2022-09-13 04:10:36
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 12, 2022 at 08:22:56PM -0700, Noah Misch wrote:
> I plan to use that message's patch, because it guarantees WALRCV_STOPPED at
> the code location being changed. Today, in the unlikely event of
> !WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code
> proceeds without waiting for WALRCV_STOPPED. I think WALRCV_STOPPING can't
> happen there. Even if WALRCV_WAITING also can't happen, it's a simplicity win
> to remove the need to think about that. Dilip Kumar and Nathan Bossart also
> stated support for that design.

I did not notice this one. Sounds pretty much right.

> If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I
> perhaps should back-patch the change to released versions. Does anyone know
> whether one or both can happen?

Hmm. I agree that this could be a good thing in the long-term.

> The startup process maintains the invariant that archive recovery and
> streaming recovery don't overlap in time; one stops before the other starts.
> As long as it achieves that, I'm not aware of a cause to care whether the flag
> reset happens when streaming ends vs. when archive recovery begins. If the
> startup process develops a defect such that it ceases to maintain that
> invariant, "when archive recovery begins" does have the
> advantage-or-disadvantage of causing a failure in non-assert builds. The
> other way gets only an assertion failure. Michael Paquier or Bharath
> Rupireddy, did you envision benefits other than that one?

I think that you are forgetting one case here though: a standby where
standby.signal is set without restore_command and with
primary_conninfo. It could move on with recovery even if the WAL
receiver is unstable when something external pushes more WAL segments
to the standby's pg_wal/. So my point of upthread is that we should
make sure that standby.signal+primary_conninfo resets the flag state
when the WAL receiver stops streaming in this case as well.

The proposed v1-0001-Do-not-skip-calling-XLogShutdownWalRcv.patch
would achieve that, because it does not change the state of
InstallXLogFileSegmentActive depending on the source being
XLOG_FROM_ARCHIVE. So I'm fine with what you want to do.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-09-13 04:32:11 Re: Background writer and checkpointer in crash recovery
Previous Message Zhang Mingli 2022-09-13 03:35:37 Removed unused param isSlice of function transformAssignmentSubscripts