Startup PANIC on standby promotion due to zero-filled WAL segment

From: Alena Vinter <dlaaren8(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Startup PANIC on standby promotion due to zero-filled WAL segment
Date: 2025-12-23 07:02:15
Message-ID: CAGWv16+R5zH0orpRYHESXGdkL2HMXYjWJGR1BfpOMDGhhaZ6bg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

During replication, when a new timeline is detected, PostgreSQL creates a
new zero-filled WAL segment on the new timeline instead of copying the
partial segment from the previous timeline. This diverges from the behavior
during timeline switches at startup.
This discrepancy can cause problems — especially under slow replication.
Consider the following scenario:

last record in TLI | | timeline switch point
v v
|-----TLI N---------------|0000000000000000000
|
|-----TLI N+1--00000000000|0000000000000000000

If a standby is promoted before the WAL segment containing the last record
of the previous timeline has been fully copied to the new timeline, startup
may fail. We have observed this in production, where recovery aborts with
"PANIC: invalid magic number 0000 in WAL segment ..."

I’ve attached:
* a patch and a TAP test that reproduce the issue;
* a draft patch that, on timeline switch during recovery, copies the
remainder of the current WAL segment from the old timeline — matching the
behavior used after crash recovery at startup.
All existing regression tests pass with the patch applied, but I plan to
add more targeted test cases.

I’d appreciate your feedback. In particular:
* Is this behavior (not copying the segment during replication) intentional?
* Are there edge cases I might be overlooking?

---
Best wishes,
Alena Vinter

Attachment Content-Type Size
recovery_tli_switch_bug_reproduction.diff text/x-patch 583 bytes
recovery_tli_switch_test.pl application/x-perl 983 bytes
v1-recovery_tli_switch_bug_fix.diff text/x-patch 1.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-12-23 07:20:45 Re: Orphaned records in pg_replication_origin_status after subscription drop
Previous Message Chao Li 2025-12-23 06:55:45 Re: Sequence Access Methods, round two