Re: Implement waiting for wal lsn replay: reloaded

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: Implement waiting for wal lsn replay: reloaded
Date: 2026-04-07 23:23:29
Message-ID: CAPpHfdvM9Tp1_g1JY9H+cx1H0Q4jH-BZFZGQwgviLNs1HSg+CA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Xuneng!

On Tue, Apr 7, 2026 at 4:00 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> On Tue, Apr 7, 2026 at 12:02 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> >
> > Hi Andres,
> >
> > On Tue, Apr 7, 2026 at 11:31 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > Hi,
> > >
> > > On 2026-04-06 23:07:45 -0400, Andres Freund wrote:
> > > > But, leaving that aside, looking at this code I'm somewhat concerned - it
> > > > seems to not worry at all about memory ordering?
> > > >
> > > >
> > > > static void
> > > > XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
> > > > ...
> > > > /* Update shared-memory status */
> > > > pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
> > > >
> > > > /*
> > > > * If we wrote an LSN that someone was waiting for, notify the waiters.
> > > > */
> > > > if (waitLSNState &&
> > > > (LogstreamResult.Write >=
> > > > pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
> > > > WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
> > > >
> > > > There are no memory barriers here, so the CPU would be entirely free to not
> > > > make the writtenUpto write visible to a waiter that's in the process of
> > > > registering and is checking whether it needs to wait in WaitForLSN().
> > > >
> > > > And WaitForLSN()->GetCurrentLSNForWaitType()->GetWalRcvWriteRecPtr() also has
> > > > no barriers. That MAYBE is ok, due addLSNWaiter() providing the barrier at
> > > > loop entry and maybe kinda you can think that WaitLatch() will somehow also
> > > > have barrier semantic. But if so, that would need to be very carefully
> > > > documented. And it seems completely unnecessary here, it's hard to believe
> > > > using a barrier (via pg_atomic_read_membarrier_u64() or such) would be a
> > > > performance issue
> >
> > Thanks for pointing this out. This is indeed a store-load ordering issue.
> >
> > > And separately from the memory ordering, how can it make sense that there's
> > > at least 5 copies of this
> > >
> > > if (waitLSNState &&
> > > (LogstreamResult.Flush >=
> > > pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
> > > WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
> > >
> > > around? That needs to be encapsulated so that if you have a bug, like the
> > > memory ordering problem I describe above, it can be fixed once, not in
> > > multiple places.
> >
> > Yeah, this duplication is not ok.
> >
> > > And why do these callers even have that pre-check? Seems WaitLSNWakeup()
> > > does so itself?
> > >
> > > /*
> > > * Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
> > > * "wake all waiters" (e.g., during promotion when recovery ends).
> > > */
> > > if (XLogRecPtrIsValid(currentLSN) &&
> > > pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
> > > return;
> > >
> > > And why is the code checking if waitLSNState is non-NULL?
> > >
> >
> > These fast checks are unnecessary copy-pastos and waitLSNState checks
> > also do not make sense except for the one in WaitLSNCleanup.
> >
> > I'll prepare a patch set addressing them.
> >
>
> Here is some analysis of the issue reported by Tom:
>
> 1) The problem
>
> WAIT FOR LSN with standby_write or standby_flush mode can block
> indefinitely on an idle primary even when the target LSN is already
> satisfied by WAL on disk.
>
> The walreceiver initializes its process-local LogstreamResult.Write
> and LogstreamResult.Flush from GetXLogReplayRecPtr() at connect time,
> reflecting all WAL already present on the standby (from a base backup,
> archive restore, or prior streaming). The shared-memory positions used
> by WAIT FOR LSN, however, are not seeded from this value:
>
> WalRcv->writtenUpto is zero-initialized by ShmemInitStruct and remains
> 0 until XLogWalRcvWrite() processes incoming streaming data.
> WalRcv->flushedUpto is initialized to the segment-aligned streaming
> start point by RequestXLogStreaming(), which may be significantly
> behind the replay position. It advances only when XLogWalRcvFlush()
> processes new data — which itself requires LogstreamResult.Flush <
> LogstreamResult.Write, a condition that never holds at startup since
> both fields are initialized to the same value.
>
> When the primary is idle and sends no new WAL, both positions stay at
> their initial stale values indefinitely.
>
> 2) The fix
> Seed writtenUpto and flushedUpto from LogstreamResult immediately
> after the walreceiver initializes those process-local fields, then
> call WaitLSNWakeup() to wake any already-blocked waiters.
>
> This broadens the semantics of these fields. writtenUpto and
> flushedUpto used to track only WAL written or flushed by the current
> walreceiver session — WAL received from the primary since the most
> recent connect. After this change, they are initialized to the replay
> position, so they also cover WAL that was already on disk before
> streaming began. This affects pg_stat_wal_receiver.written_lsn and
> flushed_lsn, which will now report the replay position immediately at
> walreceiver startup rather than 0 and the segment boundary
> respectively. I am still considering whether this semantic change is
> acceptable though it does shorten the runtime of the tap tests
> reported by Tom in my test. Another approach is to modify the logic of
> GetCurrentLSNForWaitType to cope with this special case and leave the
> publisher side alone without changing the semantics. But this seems to
> be more subtle.

Patch 0001 looks OK for me.
Regarding patch 0002. Changes made for GetCurrentLSNForWaitType()
looks reliable for me. PerformWalRecovery() sets replayed positions
before starting recovery, and in turn before standby can accept
connections. So, changes to WalReceiverMain() don't look necessary to
me.

------
Regards,
Alexander Korotkov
Supabase

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2026-04-07 23:28:29 Re: updates for handling optional argument in system functions
Previous Message Mihail Nikalayeu 2026-04-07 23:19:00 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements