| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
| Cc: | Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery |
| Date: | 2026-02-02 02:16:56 |
| Message-ID: | CABPTF7XTCHjZROh6jTMGDiiLJcNxx5wO=KMqGpMsDQkT4hTUmA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > Thanks for your report. I can reliably reproduce the issue on HEAD
> > using your scripts. I’ve analyzed the problem and am proposing a patch
> > to fix it.
> >
> > --- Analysis
> > When a cascading standby streams from an archive-only upstream:
> >
> > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position
> > (no received-but-not-replayed buffer since there's no walreceiver)
> > 2. When streaming ends and the cascade falls back to archive recovery,
> > it can restore WAL segments from its own archive access
> > 3. The cascade's read position (RecPtr) advances beyond what the
> > upstream has replayed
> > 4. On reconnect, the cascade requests streaming from RecPtr, which the
> > upstream rejects as "ahead of flush position"
> >
> > --- Proposed Fix
> >
> > Track the last confirmed flush position from streaming
> > (lastStreamedFlush) and clamp the streaming start request when it
> > exceeds that position:
>
> I haven't read the patch yet, but doesn't lastStreamedFlush represent
> the same LSN as tliRecPtr or replayLSN (the arguments to
> WaitForWALToBecomeAvailable())? If so, we may not need to introduce
> a new variable to track this LSN.
lastStreamedFlush is the upstream’s confirmed flush point from the
last streaming session—what the sender guaranteed it had. tliRecPtr is
the LSN of the start of the current WAL record which used to determine
which timeline that record belongs to (tliOfPointInHistory), and
replayLSN is how far we’ve applied locally. After archive fallback,
both tliRecPtr and replayLSN can be ahead of what the upstream has, so
they can’t safely cap a reconnect. LastStreamedFlush is used as the
upstream-capability bound.
> The choice of which LSN is used as the replication start point has varied
> over time to handle corner cases (for example, commit 06687198018).
> That makes me wonder whether we should first better understand
> why WaitForWALToBecomeAvailable() currently uses RecPtr as
> the starting point.
AFAICS, fix 06687198018 addresses a scenario where a standby gets
stuck reading a continuation record that spans multiple pages/segments
when the pages must come from different sources.
The problem: if the first page is read successfully from local pg_wal
but the second page contains garbage from a recycled segment, the old
code would enter an infinite loop. This happened because:
Late failure detection: Page header validation occurred inside
XLogReadRecord(), which triggered ReadRecord()'s retry-from-beginning
logic—restarting the entire record read from local sources without
ever trying streaming.
Wrong streaming start position: Even if streaming was eventually
attempted, it started from tliRecPtr (record start) rather than RecPtr
(current read position), potentially re-requesting segments the
primary had already recycled.
The fix has two parts:
Early page header validation: Validate the page header immediately
after reading, before returning to the caller. If garbage is detected
(typically via xlp_pageaddr mismatch), jump directly to
next_record_is_invalid to try an alternative source (streaming),
bypassing ReadRecord()'s retry loop.
Correct streaming start position: Change from ptr = tliRecPtr to ptr =
RecPtr, so streaming begins at the position where data is actually
needed. The record start position (tliRecPtr) is still used for
timeline determination, but no longer for the streaming start LSN.
Together, these changes ensure the standby escapes the local-read
retry loop and fetches the continuation data from the correct position
via streaming.
> BTW, with v1 patch, I was able to reproduce the issue using the following steps:
>
> --------------------------------------------
> initdb -D data
> mkdir arch
> cat <<EOF >> data/postgresql.conf
> archive_mode = on
> archive_command = 'cp %p ../arch/%f'
> restore_command = 'cp ../arch/%f %p'
> EOF
> pg_ctl -D data start
> pg_basebackup -D sby1 -c fast
> cp -a sby1 sby2
> cat <<EOF >> sby1/postgresql.conf
> port = 5433
> EOF
> touch sby1/standby.signal
> pg_ctl -D sby1 start
> cat <<EOF >> sby2/postgresql.conf
> port = 5434
> primary_conninfo = 'port=5433'
> EOF
> touch sby2/standby.signal
> pg_ctl -D sby2 start
> pgbench -i -s2
> pg_ctl -D sby2 restart
> --------------------------------------------
>
> In this case, after restarting the standby connecting to another
> (cascading) standby, I observed the following error.
>
> FATAL: could not receive data from WAL stream: ERROR: requested
> starting point 0/04000000 is ahead of the WAL flush position of this
> server 0/03FFE8D0
>
After sby2 restarts, its WAL read position (RecPtr) is set to the
segment boundary 0/04000000, but the upstream sby1 (archive-only
standby with no walreceiver) can only serve up to its replay position
0/03FFE8D0. The cascade requests WAL ahead of what the upstream can
provide.The issue is that no in-memory state survives the restart to
cap the streaming start request. Before restart, the walreceiver knew
what the upstream had confirmed; after restart, that information is
lost.
One potential solution is a "handshake clamp": after connecting,
obtain the upstream's current flush LSN from IDENTIFY_SYSTEM and clamp
the streaming start position to Min(startpoint, primaryFlush) before
sending START_REPLICATION. But I think this is somewhat complicated.
--
Best,
Xuneng
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-02-02 02:18:08 | Re: AIX support |
| Previous Message | Corey Huinker | 2026-02-02 01:41:28 | Re: Is there value in having optimizer stats for joins/foreignkeys? |