| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
| Subject: | Re: Implement waiting for wal lsn replay: reloaded |
| Date: | 2026-04-07 12:59:58 |
| Message-ID: | CABPTF7WAcV_uEaJa3jmYPc26rApa64hV5HmjRBN=OS17MAybUg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Apr 7, 2026 at 12:02 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi Andres,
>
> On Tue, Apr 7, 2026 at 11:31 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2026-04-06 23:07:45 -0400, Andres Freund wrote:
> > > But, leaving that aside, looking at this code I'm somewhat concerned - it
> > > seems to not worry at all about memory ordering?
> > >
> > >
> > > static void
> > > XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
> > > ...
> > > /* Update shared-memory status */
> > > pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
> > >
> > > /*
> > > * If we wrote an LSN that someone was waiting for, notify the waiters.
> > > */
> > > if (waitLSNState &&
> > > (LogstreamResult.Write >=
> > > pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
> > > WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
> > >
> > > There are no memory barriers here, so the CPU would be entirely free to not
> > > make the writtenUpto write visible to a waiter that's in the process of
> > > registering and is checking whether it needs to wait in WaitForLSN().
> > >
> > > And WaitForLSN()->GetCurrentLSNForWaitType()->GetWalRcvWriteRecPtr() also has
> > > no barriers. That MAYBE is ok, due addLSNWaiter() providing the barrier at
> > > loop entry and maybe kinda you can think that WaitLatch() will somehow also
> > > have barrier semantic. But if so, that would need to be very carefully
> > > documented. And it seems completely unnecessary here, it's hard to believe
> > > using a barrier (via pg_atomic_read_membarrier_u64() or such) would be a
> > > performance issue
>
> Thanks for pointing this out. This is indeed a store-load ordering issue.
>
> > And separately from the memory ordering, how can it make sense that there's
> > at least 5 copies of this
> >
> > if (waitLSNState &&
> > (LogstreamResult.Flush >=
> > pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
> > WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
> >
> > around? That needs to be encapsulated so that if you have a bug, like the
> > memory ordering problem I describe above, it can be fixed once, not in
> > multiple places.
>
> Yeah, this duplication is not ok.
>
> > And why do these callers even have that pre-check? Seems WaitLSNWakeup()
> > does so itself?
> >
> > /*
> > * Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
> > * "wake all waiters" (e.g., during promotion when recovery ends).
> > */
> > if (XLogRecPtrIsValid(currentLSN) &&
> > pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
> > return;
> >
> > And why is the code checking if waitLSNState is non-NULL?
> >
>
> These fast checks are unnecessary copy-pastos and waitLSNState checks
> also do not make sense except for the one in WaitLSNCleanup.
>
> I'll prepare a patch set addressing them.
>
Here is some analysis of the issue reported by Tom:
1) The problem
WAIT FOR LSN with standby_write or standby_flush mode can block
indefinitely on an idle primary even when the target LSN is already
satisfied by WAL on disk.
The walreceiver initializes its process-local LogstreamResult.Write
and LogstreamResult.Flush from GetXLogReplayRecPtr() at connect time,
reflecting all WAL already present on the standby (from a base backup,
archive restore, or prior streaming). The shared-memory positions used
by WAIT FOR LSN, however, are not seeded from this value:
WalRcv->writtenUpto is zero-initialized by ShmemInitStruct and remains
0 until XLogWalRcvWrite() processes incoming streaming data.
WalRcv->flushedUpto is initialized to the segment-aligned streaming
start point by RequestXLogStreaming(), which may be significantly
behind the replay position. It advances only when XLogWalRcvFlush()
processes new data — which itself requires LogstreamResult.Flush <
LogstreamResult.Write, a condition that never holds at startup since
both fields are initialized to the same value.
When the primary is idle and sends no new WAL, both positions stay at
their initial stale values indefinitely.
2) The fix
Seed writtenUpto and flushedUpto from LogstreamResult immediately
after the walreceiver initializes those process-local fields, then
call WaitLSNWakeup() to wake any already-blocked waiters.
This broadens the semantics of these fields. writtenUpto and
flushedUpto used to track only WAL written or flushed by the current
walreceiver session — WAL received from the primary since the most
recent connect. After this change, they are initialized to the replay
position, so they also cover WAL that was already on disk before
streaming began. This affects pg_stat_wal_receiver.written_lsn and
flushed_lsn, which will now report the replay position immediately at
walreceiver startup rather than 0 and the segment boundary
respectively. I am still considering whether this semantic change is
acceptable though it does shorten the runtime of the tap tests
reported by Tom in my test. Another approach is to modify the logic of
GetCurrentLSNForWaitType to cope with this special case and leave the
publisher side alone without changing the semantics. But this seems to
be more subtle.
--
Best,
Xuneng
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Remove-redundant-WAIT-FOR-LSN-caller-side-pre-che.patch | application/octet-stream | 5.0 KB |
| v1-0002-Fix-WAIT-FOR-LSN-standby_write-standby_flush-hang.patch | application/octet-stream | 3.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Evan Montgomery-Recht | 2026-04-07 12:59:59 | Re: Eliminating SPI / SQL from some RI triggers - take 3 |
| Previous Message | Alexander Korotkov | 2026-04-07 12:52:03 | Re: Implement waiting for wal lsn replay: reloaded |