Re: Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline

From: surya poondla <suryapoondla4(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline
Date: 2026-07-02 17:59:08
Message-ID: CAOVWO5oaVqcm54w9LmY6JSx5Mk_XwUc4iJPDO8ZZYnHLoVX5JA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andrey,

I finally got some time to review this patch. +1 it's clearly the same bug
as the 2022 thread Ants linked, and 0001 lands where that discussion ended
up.
Since it's silent corruption on an ordinary setup and reproduces back to
v12, I think it needs a back-patch to all supported branches.

I iterated through the cases that seemed interesting, like a mid-segment
switch, a switch that lands exactly on a segment boundary, and
a three-timeline chain and the patch works as expected.
In particular the beginseg check stays a "continue", so a too-new timeline
doesn't consume the single attempt and the real owner is still tried.
The 053 test is a good catch: a fix that only guarded the target timeline's
own switch point would walk straight past that intermediate case.

A couple of comments:
1. Removing the fallback also removes the one case where it did the right
thing: a cascading failover where the child timeline promoted and archived
its
history file but died before archiving the switch-point segment. There,
only the parent's copy of that segment is available, and everything in it
before the switch point is still valid WAL and today we replay it, with
this patch we stall.
I think stalling is the correct default (a visible wait is far better than
silently applying divergent data), maybe we call this out in the commit
message.
The complete fix, stopping at the exact switch LSN instead of at segment
granularity, is a bigger change and fair to leave for later.

2. The comment "the parent's WAL is no longer valid" overstates it; the
pre-switch bytes are valid, we just can't stop mid-segment.
Worth rewording, plus a line on how found_eligible relates to the existing
"tli < curFileTLI" guard (they're not redundant).

3. The 052/053 tests only assert the negative, so they'd also pass if
recovery never reached the segment.
Could you add a positive check where the divergent rows never appear, and
in one test that recovery finishes on the right
timeline once the correct segment shows up?

4. With the fallback gone, a missing child segment now looks like a hang
with only the DEBUG2 "could not open file" trace. A
LOG/DEBUG1 line where we decline to fall back ("waiting for the owning
timeline's segment N") would make the intended wait self-explaining.

I haven't dug into 0002/0003 yet, but from a high level they look
independent of 001 patch. I'll review those soon.

Thanks for chasing this down.

Regards,
Surya Poondla

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul A Jungwirth 2026-07-02 18:06:45 Re: Fix RLS checks for UPDATE/DELETE FOR PORTION OF leftover rows
Previous Message Sami Imseih 2026-07-02 17:57:38 Re: Improve pg_stat_statements scalability