Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline
Date: 2026-06-08 13:00:17
Message-ID: CABPTF7U_B9pdC563fFONQLX_FXCZtxZgM+jN9snzMVg9b9MRfg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 8, 2026 at 4:31 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi Kuroda-san,
>
> On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Hi Alexander, Bertrand, Xuneng,
> >
> > Thanks for seeing the failure. Our team also recognized but could not find the reason.
> >
> > > Yeah, it looks like there is a race condition here. I think we should check if
> > > the insertion timeline has already been set (like the walsummarizer is doing).
> >
> > Sorry for stupid question; I tried to reproduce the failure but could not, see attached.
> >
> > IIUC, the issue can happen if the walsender must read the WAL record generated
> > after the promotion but the timeline could not be updated.
>
> I think the race is that the logical walsender is trying to read the
> pre-promotion wal from the slot's restart_lsn.
>
> 1) In 035_standby_logical_decoding.pl, the active slot /
> pg_recvlogical process is started before:
> INSERT rows 1..4 # before promotion
> promote standby
> INSERT rows 5..7 # after promotion
>
> 2) Alexander’s reproducer sleeps immediately after:
> ReplicationSlotAcquire(cmd->slotname, true, true);
> pg_usleep(200000);
>
> So with the delay:
> slot is active but walsender has not yet create decoding context and
> called xlogbeginread();
>
> 3) Then the test continues and promotes the standby while the
> walsender still sleeping, therefore promotion could happen before the
> walsender starts its actual wal read.
>
> 4) The logical walsender is trying to read the pre-promotion wal from
> the slot's restart_lsn. The failing log implies this:
> Streaming transactions committing after 0/06487A20,
> reading WAL from 0/064879E8.
>
> This line is emitted from:
> errdetail("Streaming transactions committing after %X/%08X, reading
> WAL from %X/%08X.",
> slot->data.confirmed_flush,
> slot->data.restart_lsn)
>
> Function StartLogicalReplication() initiates the xlogreader at the restart lsn:
> XLogBeginRead(logical_decoding_ctx->reader,
> MyReplicationSlot->data.restart_lsn);
>
> Because the slot was created and acquired before promotion, its
> restart_lsn is from the standby-era state which points to wal required
> for decoding the pre-promotion stream. In the test, the first expected
> output is rows 1..4, inserted before promotion.
>
> The failing output was empty:
> got: ''
> expected:
> BEGIN
> rows 1..4
> COMMIT
> BEGIN
> rows 5..7
> COMMIT
>
> Empty output suggests that the walsender died before emitting even the
> first pre-promotion transaction.
>
> 5) The walsender try to read wal but failed
> Failed run:
> LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
> path: pg_wal/000000010000000000000006
> ERROR: requested WAL segment
> 000000010000000000000006 has already been removed
>
> Passed run
> LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
> path: pg_wal/000000020000000000000006
>
> As the log suggested, the walsender of the successful run readed the
> wal from timeline 2 but failed to do so for timeline1.
> This comment in xlogutils.is almost exactly the issue:
>
> We care about timelines ... reading xlog generated prior to a promotion
> ...
> The server copied the segment to the new timeline ... there's no guarantee
> the old segment will still exist. It may have been deleted or renamed with
> a .partial suffix
>
> So the need for tli1 does not always imply that the wal was generated
> after promotion. It could be bytes from before the switchpoint, but
> they now be available through the new tli segment file.
>
> 6) XLogReadDetermineTimeline() needs the current system-wide TLI.
>
> The current code in does roughly:
>
> am_cascading_walsender = RecoveryInProgress();
>
> if (am_cascading_walsender)
> GetXLogReplayRecPtr(&currTLI);
> else
> currTLI = GetWALInsertionTimeLine();
>
> During promotion, we set the wal insertion timeline before marking
> recovery as fully done. So there is a window where:
>
> - InsertTimeLineID is already the new timeline, e.g. tli 2.
> - RecoveryInProgress() can still return true in another backend.
> - A logical walsender that started before promotion can wake up and
> choose its read timeline using the old recovery-state test.
>
> In the race, replay tli 1 is chosen even though the promotion has
> created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries
> to open 000000010000000000000006, which may already be gone or
> renamed, while the correct file is 000000020000000000000006.
>
> 7) Bertrand's proposal to fix this
> Recognizing the intermediate promotion state -- if recovery still
> shows progress but the insertion tli is already advanced, use that
> insertion tli for the timeline to be chosen.
>
> > However, I think logical_read_xlog_page() is called after the new WAL records
> > are generated, i.e., am_cascading_walsender has already been false at that time.
> > So not sure where is the race?
>
> This function is not only called after new post-promotion records are
> generated. When START_REPLICATION begins, CreateDecodingContext() /
> XLogBeginRead() starts from the slot's restart_lsn, which can be well
> before promotion.
>

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

The test failed on HEAD and passed after applying the patch by Bertrand.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Attachment Content-Type Size
v1-0001-Reproducer-for-logical-decoding-choosing-the-wron.patch application/octet-stream 5.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-06-08 13:01:24 Re: Fix tuple deformation with virtual generated NOT NULL columns
Previous Message Etsuro Fujita 2026-06-08 12:45:03 Re: [(known) BUG] DELETE/UPDATE more than one row in partitioned foreign table