Quick Links

Re: Incorrect logic in XLogNeedsFlush()

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Incorrect logic in XLogNeedsFlush()
Date:	2025-09-15 14:58:28
Message-ID:	CAAKRu_ZEcWNbWAf_bBpSugHh=GTS=EYmb_8npSztU=vLcn=H+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Sep 14, 2025 at 4:10 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Sep 12, 2025 at 10:02 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> > Assuming I'm right so far, then I agree with changing the test in
> > XLogNeedsFlush() to !XLogInsertAllowed(), as the patch does.
> >
> > The comment in the patch is describing what's happening, but lost the
> > "why". Perhaps something like:
> >
> > "During recovery, we don't flush WAL but update minRecoveryPoint
> > instead. So "needs flush" is taken to mean whether minRecoveryPoint
> > would need to be updated. The end-of-recovery checkpoint still must
> > flush WAL like normal, though, so check !XLogInsertAllowed(). This
> > check must be consistent with XLogFlush()."
>
> Updated as per suggestion

I like the idea of adding an assert to keep the two in sync. However, two things

1) Since XLogNeedsFlush() refreshes LogwrtResult, is it possible for
this assert to fail because the flush pointer was updated? I realize
that that is the case with my original patch as well. If this is true,
perhaps there is not a way to use XLogNeedsFlush() after XLogFlush()
-- only before as a check to see if we must call XLogFlush().

2) You've added the assert at the bottom which will only cover the
flush pointer case and not the min recovery point case. Maybe that's
good enough though...

> > One loose end: there was some discussion about the times when
> > LocalMinRecoveryPoint is valid/correct. I'm not sure I entirely
> > understood -- is that no longer a concern?
>
> IMHO during crash recovery LocalMinRecoveryPoint and
> ControlFile->minRecoveryPoint can not initialized until we replay all
> WAL so those should never get accessed. However if XLogNeedsFlush()
> were called during the end of the recovery checkpoint it was accessing
> this which was wrong although it was not a live bug and this patch is
> making that more resilient. OTOH if we are creating a restartpoint
> during standby replay or archive recovery then we do access
> LocalMinRecoveryPoint and ControlFile->minRecoveryPoint because by
> that time we have already done the minimum recovery require for
> reaching to the consistent state and we would have initialized
> ControlFile->minRecoveryPoint.
>
> IMHO during crash recovery, the variables LocalMinRecoveryPoint and
> ControlFile->minRecoveryPoint remain uninitialized until all required
> WAL has been replayed. Consequently, they must not be accessed during
> this phase. Previously, calling XLogNeedsFlush() during the "end of
> recovery checkpoint" incorrectly accessed these uninitialized values.
> While this was not a live bug, this patch hardens the code against
> this type of race condition.
>
> OTOH, when creating a RestartPoint during standby replay or doing
> archive recovery, access to both LocalMinRecoveryPoint and
> ControlFile->minRecoveryPoint is correct. By this stage, it has
> completed the minimum recovery required to achieve a consistent state,
> and ControlFile->minRecoveryPoint would have been properly
> initialized.

My confusion was two things: how the startup process is getting a
valid minimum recovery point during crash recovery on a standby and
how bgwriter and checkpointer avoid reading the control file during
crash recovery on the standby.

IIUC, the minimum recovery point on the primary will always be
InvalidXLogRecPtr.

On the standby, during "normal" recovery, all backends (regular
backends running read-only queries, checkpointer, etc) update the
minimum recovery point so it is ready in case of a crash.

During recovery from a crash on the standby, no regular backends will
be running read-only queries because those are forbidden until crash
recovery finishes. Crash recovery on the standby is finished when WAL
has been replayed through the minimum recovery point -- or, at least I
thought that was the point of the minimum recovery point.

But, the startup process needs to replay WAL through the minimum
recovery point. So, where does it get this value from? The control
file?

And if the startup process gets the minimum recovery point from the
ControlFile, then how are bgwriter and checkpointer protected from
reading it since they should only see InvalidXLogRecPtr during crash
recovery?

Looking at this logic in XLogFlush() -> UpdateMinRecoveryPoint()

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
{
updateMinRecoveryPoint = false;
return;
}

Only the startup process will pass this test. Then other processes
will continue to read the control file. This only works if
ControlFile->minRecoveryPoint is InvalidXLogRecPtr during crash
recovery.

As for your patch, I think we should also update the comment at the
top of XLogNeedsFlush(). Something like this when combined with your
comment update and code.

/*
- * Test whether XLOG data has been flushed up to (at least) the given position.
+ * Test whether XLOG data has been flushed up to (at least) the given position
+ * or whether the minimum recovery point is updated past the given position.
*
- * Returns true if a flush is still needed. (It may be that someone else
- * is already in process of flushing that far, however.)
+ * Returns true if a flush is still needed or if the minimum recovery point
+ * must be updated.
+ *
+ * It is possible that someone else is already in the process of flushing that
+ * far or updating the minimum recovery point that far.
*/
bool
XLogNeedsFlush(XLogRecPtr record)
{
/*
- * During recovery, we don't flush WAL but update minRecoveryPoint
- * instead. So "needs flush" is taken to mean whether minRecoveryPoint
- * would need to be updated.
+ * During recovery, when WAL inserts are forbidden, "needs flush" is
+ * really "minimum recovery point needs updating". The end-of-recovery
+ * checkpoint still must flush WAL like normal, though, so check
+ * !XLogInsertAllowed(). This check must be consistent with
XLogFlush().
*/
- if (RecoveryInProgress())
+ if (!XLogInsertAllowed())
{
+ Assert(RecoveryInProgress());

And I don't know if WAL inserts could ever be forbidden outside of
recovery (maybe if a read-only database mode is added) -- but maybe it
is worth adding an assertion that recovery is in progress?

- Melanie

In response to

Re: Incorrect logic in XLogNeedsFlush() at 2025-09-14 08:09:51 from Dilip Kumar

Responses

Re: Incorrect logic in XLogNeedsFlush() at 2025-09-15 19:02:36 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2025-09-15 15:12:16	Re: index prefetching
Previous Message	Robert Haas	2025-09-15 14:46:45	Re: BUG #18959: Name collisions of expression indexes during parallel Index creations on a pratitioned table.