| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
| Cc: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline |
| Date: | 2026-06-08 08:31:39 |
| Message-ID: | CABPTF7WmLKOJcSPod7zT2hynUFQcUs0VoQyR-p=XFsafvmGm7g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Kuroda-san,
On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Hi Alexander, Bertrand, Xuneng,
>
> Thanks for seeing the failure. Our team also recognized but could not find the reason.
>
> > Yeah, it looks like there is a race condition here. I think we should check if
> > the insertion timeline has already been set (like the walsummarizer is doing).
>
> Sorry for stupid question; I tried to reproduce the failure but could not, see attached.
>
> IIUC, the issue can happen if the walsender must read the WAL record generated
> after the promotion but the timeline could not be updated.
I think the race is that the logical walsender is trying to read the
pre-promotion wal from the slot's restart_lsn.
1) In 035_standby_logical_decoding.pl, the active slot /
pg_recvlogical process is started before:
INSERT rows 1..4 # before promotion
promote standby
INSERT rows 5..7 # after promotion
2) Alexander’s reproducer sleeps immediately after:
ReplicationSlotAcquire(cmd->slotname, true, true);
pg_usleep(200000);
So with the delay:
slot is active but walsender has not yet create decoding context and
called xlogbeginread();
3) Then the test continues and promotes the standby while the
walsender still sleeping, therefore promotion could happen before the
walsender starts its actual wal read.
4) The logical walsender is trying to read the pre-promotion wal from
the slot's restart_lsn. The failing log implies this:
Streaming transactions committing after 0/06487A20,
reading WAL from 0/064879E8.
This line is emitted from:
errdetail("Streaming transactions committing after %X/%08X, reading
WAL from %X/%08X.",
slot->data.confirmed_flush,
slot->data.restart_lsn)
Function StartLogicalReplication() initiates the xlogreader at the restart lsn:
XLogBeginRead(logical_decoding_ctx->reader,
MyReplicationSlot->data.restart_lsn);
Because the slot was created and acquired before promotion, its
restart_lsn is from the standby-era state which points to wal required
for decoding the pre-promotion stream. In the test, the first expected
output is rows 1..4, inserted before promotion.
The failing output was empty:
got: ''
expected:
BEGIN
rows 1..4
COMMIT
BEGIN
rows 5..7
COMMIT
Empty output suggests that the walsender died before emitting even the
first pre-promotion transaction.
5) The walsender try to read wal but failed
Failed run:
LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000010000000000000006
ERROR: requested WAL segment
000000010000000000000006 has already been removed
Passed run
LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000020000000000000006
As the log suggested, the walsender of the successful run readed the
wal from timeline 2 but failed to do so for timeline1.
This comment in xlogutils.is almost exactly the issue:
We care about timelines ... reading xlog generated prior to a promotion
...
The server copied the segment to the new timeline ... there's no guarantee
the old segment will still exist. It may have been deleted or renamed with
a .partial suffix
So the need for tli1 does not always imply that the wal was generated
after promotion. It could be bytes from before the switchpoint, but
they now be available through the new tli segment file.
6) XLogReadDetermineTimeline() needs the current system-wide TLI.
The current code in does roughly:
am_cascading_walsender = RecoveryInProgress();
if (am_cascading_walsender)
GetXLogReplayRecPtr(&currTLI);
else
currTLI = GetWALInsertionTimeLine();
During promotion, we set the wal insertion timeline before marking
recovery as fully done. So there is a window where:
- InsertTimeLineID is already the new timeline, e.g. tli 2.
- RecoveryInProgress() can still return true in another backend.
- A logical walsender that started before promotion can wake up and
choose its read timeline using the old recovery-state test.
In the race, replay tli 1 is chosen even though the promotion has
created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries
to open 000000010000000000000006, which may already be gone or
renamed, while the correct file is 000000020000000000000006.
7) Bertrand's proposal to fix this
Recognizing the intermediate promotion state -- if recovery still
shows progress but the insertion tli is already advanced, use that
insertion tli for the timeline to be chosen.
> However, I think logical_read_xlog_page() is called after the new WAL records
> are generated, i.e., am_cascading_walsender has already been false at that time.
> So not sure where is the race?
This function is not only called after new post-promotion records are
generated. When START_REPLICATION begins, CreateDecodingContext() /
XLogBeginRead() starts from the slot's restart_lsn, which can be well
before promotion.
--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fujii Masao | 2026-06-08 08:32:18 | Re: ECPG: inconsistent behavior with the document in “GET/SET DESCRIPTOR.” |
| Previous Message | jian he | 2026-06-08 08:20:38 | Re: Fix domain fast defaults on empty tables |