Re: Incorrect logic in XLogNeedsFlush()

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Incorrect logic in XLogNeedsFlush()
Date: 2025-09-17 13:35:32
Message-ID: CAFiTN-sw1aBxUjKk74m0LG_GGHNat8ZJSyYRMpco48G8hQwMkQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 15, 2025 at 8:28 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Sun, Sep 14, 2025 at 4:10 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > OTOH, when creating a RestartPoint during standby replay or doing
> > archive recovery, access to both LocalMinRecoveryPoint and
> > ControlFile->minRecoveryPoint is correct. By this stage, it has
> > completed the minimum recovery required to achieve a consistent state,
> > and ControlFile->minRecoveryPoint would have been properly
> > initialized.
>
> My confusion was two things: how the startup process is getting a
> valid minimum recovery point during crash recovery on a standby and
> how bgwriter and checkpointer avoid reading the control file during
> crash recovery on the standby.
>
> IIUC, the minimum recovery point on the primary will always be
> InvalidXLogRecPtr.

Right

> On the standby, during "normal" recovery, all backends (regular
> backends running read-only queries, checkpointer, etc) update the
> minimum recovery point so it is ready in case of a crash.

Right

> During recovery from a crash on the standby, no regular backends will
> be running read-only queries because those are forbidden until crash
> recovery finishes. Crash recovery on the standby is finished when WAL
> has been replayed through the minimum recovery point -- or, at least I
> thought that was the point of the minimum recovery point.
>
> But, the startup process needs to replay WAL through the minimum
> recovery point. So, where does it get this value from? The control
> file?

In a recovery scenario, the ControlFile->minRecoveryPoint on a standby
server is continuously updated. This ensures that even in the event of
a crash, a valid recovery point is available. However, if a crashed
standalone server is started as a standby, the
ControlFile->minRecoveryPoint is initially unset, and this will be set
in ReadRecord()->SwitchIntoArchiveRecovery() once all the WAL from
pg_wal directory are applied, this happens right before start
streaming from primary.

> And if the startup process gets the minimum recovery point from the
> ControlFile, then how are bgwriter and checkpointer protected from
> reading it since they should only see InvalidXLogRecPtr during crash
> recovery?
>
> Looking at this logic in XLogFlush() -> UpdateMinRecoveryPoint()
>
> if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
> {
> updateMinRecoveryPoint = false;
> return;
> }
>
> Only the startup process will pass this test. Then other processes
> will continue to read the control file. This only works if
> ControlFile->minRecoveryPoint is InvalidXLogRecPtr during crash
> recovery.

Until the startup process hasn't completed the crash recovery i.g. not
applied the WAL unto 'ControlFile->minRecoveryPoint' logically
'bgwriter' can not write any buffer whose WAL location is beyond the
ControlFile->minRecoveryPoint because we haven't yet applied those
WALs and also there should not any RestartPoint()/Checkpoint record
before reaching the ControlFile->minRecoveryPoint otherwise we would
have started crash recovery from that point.

During the crash recovery process, before the WAL has been fully
applied up to the ControlFile->minRecoveryPoint, the bgwriter cannot
write any buffers that have a page LSN beyond this min recovery point.
This is because those WALs which are beyond min recovery point have
not yet been applied, so any buffer can not have page_lsn beyond this
point.

Therefore, I agree that when UpdateMinRecoveryPoint() is invoked by
the bgwriter during standby crash recovery, it will indeed retrieve a
valid value from ControlFile->minRecoveryPoint. However, IMHO, this
cannot be called with LSN greater than ControlFile->minRecoveryPoint
itself. Consequently, the conditional check else if (force ||
LocalMinRecoveryPoint < lsn) will not evaluate to true, and no action
will be taken.

And as far as the checkpointer is concerned, on standby systems, we
only execute RestartPoint when a Checkpoint WAL is detected. And
logically, no Checkpoint WAL should exist between the recovery start
point and minRecoveryPoint, as crash recovery initiates from the most
recent checkpoint redo.

--
Regards,
Dilip Kumar
Google

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mircea Cadariu 2025-09-17 13:35:53 Re: [BUG] temporary file usage report with extended protocol and unnamed portals
Previous Message Fujii Masao 2025-09-17 13:31:03 Re: pgbench: remove an unused function argument