Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
Date: 2025-11-19 13:39:18
Message-ID: CABPTF7VJ7JFNkySqfYH4+RDxekEvcHRwx3JouFbvq_GW8VXtKw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Nov 19, 2025 at 11:44 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi Michael,
>
> On Sat, Nov 8, 2025 at 7:03 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:
> > > Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
> > > wake-up calls (447aae1) are in place, this patch has been updated to
> > > make use of them.
> > > Please check.
> >
> > That's indeed much simpler. I'll check later what you have here.
> > --
> > Michael
>
> Thanks again for your earlier suggestion on splitting the patches to
> make the review process smoother.
>
> Although this version is simpler in terms of the amount of code, the
> review effort still feels non-trivial. During my own self-review, a
> few points stood out as areas that merit careful consideration:
>
> 1) Reliance on the new wait-for-LSN infrastructure
>
> The stability and correctness of this patch now depend heavily on the
> newly added wait-for-LSN infrastructure, which has not yet been
> battle-tested. This puts the patch in a bit of a dilemma: we want the
> infrastructure to be as reliable as possible, but it could be hard to
> fully validate its robustness without using it in real scenarios, even
> after careful review.

Here are some of my incomplete interpretations of the behaviors:

> 2) Wake-up behavior
>
> Are the waiting processes waking up at the correct points and under
> the right conditions? Ensuring proper wake-ups is essential for both
> correctness and performance.

Primary (Flush Wait):

The patch adds WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush)
in XLogFlush and XLogBackgroundFlush right after
/* wake up walsenders now that we've released heavily contended locks */
WalSndWakeupProcessRequests(true, !RecoveryInProgress());
where walsenders get notified.

Standby (Replay Wait):
-- The "End of Recovery" Wake-up
Location: xlog.c (inside StartupXLOG, around line 6266)

/*
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This call happens immediately after the server transitions from
"Recovery" to "Production" mode (RECOVERY_STATE_DONE).

-- The "Continuous Replay" Wake-up
Location: xlogrecovery.c (inside the main redo loop, around line 1850)
/*
* If we replayed an LSN that someone was waiting for then walk
* over the shared memory array and set latches to notify the
* waiters.
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);

It handles the continuous stream of updates during normal standby operation.

> 3) Edge cases
>
> Are edge cases—such as a promotion occurring while a process is
> waiting in standby—handled correctly and without introducing races or
> inconsistent states?

Consider the following sequence which I traced through the logic:

1. Pre-Promotion: A backend (e.g., a logical decoding session) calls
read_local_xlog_page_guts for a future LSN. RecoveryInProgress()
returns true, so it enters WaitForLSN(WAIT_LSN_TYPE_REPLAY, ...).

2. The Event: pg_promote() is issued. The Startup process finishes
recovery and broadcasts a wake-up to all waiters.

3. Detection: WaitForLSN returns WAIT_LSN_RESULT_NOT_IN_RECOVERY. The
code explicitly handles this case:

case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
/* Promoted while waiting... loop back */
break;

4. The Transition: The loop restarts.
-- RecoveryInProgress() is checked again and now returns false.
-- The logic automatically switches branches to
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...).

5. This transition relaxes the constraint from "wait for replay"
(required for consistency on standby) to "wait for flush" (required
for durability on primary).

6. Timeline Divergence:
XLogReadDetermineTimeline is called at the top of the loop.

-- Scenario A: Waiting for Historical Data (Pre-Promotion)
If we were waiting for LSN 0/5000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will see that 0/5000
belongs to TLI 1 (now historical).
Result: state->currTLI (1) != currTLI (2).
Action: The loop breaks immediately (via the else block), skipping
the wait. Since the data is historical, it is immutable and assumed to
be on disk.

-- Scenario B: Waiting for Future Data (Post-Promotion)
If we were waiting for LSN 0/7000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will identify that 0/7000
belongs to the new TLI 2.
Result: state->currTLI (2) == currTLI (2).
Action: The loop continues, and we enter
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...) to wait for the new primary to
generate this data.

-- Scenario C: Waiting exactly at the Switch Point
If we were waiting for the exact LSN where the timeline switched.
Action: XLogReadDetermineTimeline handles the boundary calculation
(tliSwitchPoint), ensuring we read from the correct segment file
(e.g., switching from 00000001... to 00000002...).

--
Best,
Xuneng

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2025-11-19 13:47:41 Re: [PATCH] Add hints for invalid binary encoding names in encode/decode functions
Previous Message Nazir Bilal Yavuz 2025-11-19 13:17:07 Re: BUG #19095: Test if function exit() is used fail when linked static