Re: Assertion failure in WaitForWALToBecomeAvailable state machine

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date: 2022-02-11 16:55:49
Message-ID: CALj2ACWEoYibaTELh2dXPLd25o0K7SkJ-dVaG64t0tQACExnSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 11, 2022 at 6:31 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Feb 11, 2022 at 6:22 PM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> >
> > On Fri, Feb 11, 2022 at 3:33 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
>
> > IIUC, the issue can happen while the walreceiver failed to get WAL
> > from primary for whatever reasons and its status is not
> > WALRCV_STOPPING or WALRCV_STOPPED, and the startup process moved ahead
> > in WaitForWALToBecomeAvailable for reading from archive which ends up
> > in this assertion failure. ITSM, a rare scenario and it depends on
> > what walreceiver does between failure to get WAL from primary and
> > updating status to WALRCV_STOPPING or WALRCV_STOPPED.
> >
> > If the above race condition is a serious problem, if one thinks at
> > least it is a problem at all, that needs to be fixed.
>
> I don't think we can design a software which has open race conditions
> even though they are rarely occurring :)

Yes.

> I don't think
> > just making InstallXLogFileSegmentActive false is enough. By looking
> > at the comment [1], it doesn't make sense to move ahead for restoring
> > from the archive location without the WAL receiver fully stopped.
> > IMO, the real fix is to just remove WalRcvStreaming() and call
> > XLogShutdownWalRcv() unconditionally. Anyways, we have the
> > Assert(!WalRcvStreaming()); down below. I don't think it will create
> > any problem.
>
> If WalRcvStreaming() is returning false that means walreceiver is
> already stopped so we don't need to shutdown it externally. I think
> like we are setting this flag outside start streaming we can reset it
> also outside XLogShutdownWalRcv. Or I am fine even if we call
> XLogShutdownWalRcv() because if walreceiver is stopped it will just
> reset the flag we want it to reset and it will do nothing else.

As I said, I'm okay with just calling XLogShutdownWalRcv()
unconditionally as it just returns in case walreceiver has already
stopped and updates the InstallXLogFileSegmentActive flag to false.

Let's also hear what other hackers have to say about this.

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-02-11 17:02:37 postgres_fdw: using TABLESAMPLE to collect remote sample
Previous Message Andres Freund 2022-02-11 15:50:44 Re: pg_receivewal.exe unhandled exception in zlib1.dll