Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

From: Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Date: 2026-03-18 09:49:25
Message-ID: CA+nrD2eJUfLq8_Ed7fv-7LrmkOoLJ28LwAHh-Rjjg4RU9KOYCg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are the v4 patches implementing what I described above.

On top of Xuneng's v3 (keeping the wait_for_event and scoped log
window test improvements), the main changes are:

- The wait is now capped at one wal_segment_size. If the gap is
larger, we skip the wait and let START_REPLICATION fail normally
so the startup process can fall back to archive. This avoids
indefinite polling when the upstream is fundamentally behind.

- The first "ahead of flush position" message is logged at LOG,
subsequent ones at DEBUG1, to cut down on noise during a long wait.

Two patches attached: v4-0001 for master (extends the
walrcv_identify_system API with an optional server_lsn output
parameter) and v4-backpatch-0001 for stable branches (uses a global
variable to preserve ABI, per Alvaro's suggestion).

Both pass the new TAP test.

Best regards,
Marco

Attachment Content-Type Size
v4-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-arc.patch text/x-patch 14.8 KB
v4-0001-Fix-cascading-standby-reconnect-failure-after-arc.patch text/x-patch 16.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-03-18 10:00:11 SQL/PGQ: All properties reference
Previous Message Álvaro Herrera 2026-03-18 09:39:34 Re: [19] CREATE SUBSCRIPTION ... SERVER