Re: Assertion failure in WaitForWALToBecomeAvailable state machine

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date: 2022-02-11 12:51:53
Message-ID: CALj2ACUoBWbaFo_t0gew+x6n0V+mpvB_23HLvsVD9abgCShV5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 11, 2022 at 3:33 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> Hi,
>
> The problem is that whenever we are going for streaming we always set
> XLogCtl->InstallXLogFileSegmentActive to true, but while switching
> from streaming to archive we do not always reset it so it hits
> assertion in some cases. Basically we reset it inside
> XLogShutdownWalRcv() but while switching from the streaming mode we
> only call it conditionally if WalRcvStreaming(). But it is very much
> possible that even before we call WalRcvStreaming() the walreceiver
> might have set alrcv->walRcvState to WALRCV_STOPPED. So now
> WalRcvStreaming() will return false. So I agree now we do not want to
> really shut down the walreceiver but who will reset the flag?
>
> I just ran some tests on primary and attached the walreceiver to gdb
> and waited for it to exit with timeout and then the recovery process
> hit the assertion.
>
> 2022-02-11 14:33:56.976 IST [60978] FATAL: terminating walreceiver
> due to timeout
> cp: cannot stat
> ‘/home/dilipkumar/work/PG/install/bin/wal_archive/00000002.history’:
> No such file or directory
> 2022-02-11 14:33:57.002 IST [60973] LOG: restored log file
> "000000010000000000000003" from archive
> TRAP: FailedAssertion("!XLogCtl->InstallXLogFileSegmentActive", File:
> "xlog.c", Line: 3823, PID: 60973)
>
> I have just applied a quick fix and that solved the issue, basically
> if the last failed source was streaming and the WalRcvStreaming() is
> false then just reset this flag.

IIUC, the issue can happen while the walreceiver failed to get WAL
from primary for whatever reasons and its status is not
WALRCV_STOPPING or WALRCV_STOPPED, and the startup process moved ahead
in WaitForWALToBecomeAvailable for reading from archive which ends up
in this assertion failure. ITSM, a rare scenario and it depends on
what walreceiver does between failure to get WAL from primary and
updating status to WALRCV_STOPPING or WALRCV_STOPPED.

If the above race condition is a serious problem, if one thinks at
least it is a problem at all, that needs to be fixed. I don't think
just making InstallXLogFileSegmentActive false is enough. By looking
at the comment [1], it doesn't make sense to move ahead for restoring
from the archive location without the WAL receiver fully stopped.
IMO, the real fix is to just remove WalRcvStreaming() and call
XLogShutdownWalRcv() unconditionally. Anyways, we have the
Assert(!WalRcvStreaming()); down below. I don't think it will create
any problem.

[1]
/*
* Before we leave XLOG_FROM_STREAM state, make sure that
* walreceiver is not active, so that it won't overwrite
* WAL that we restore from archive.
*/
if (WalRcvStreaming())
XLogShutdownWalRcv();

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2022-02-11 12:59:18 Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit
Previous Message Julien Rouhaud 2022-02-11 12:51:10 Re: Database-level collation version tracking