Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

From: Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Date: 2026-03-21 10:52:28
Message-ID: CA+nrD2ctWvVCkxNDDEO0C3SUposCKV_k0AL5duxsugA+-SS8hA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are the v6 patches.

Xuneng correctly pointed out that RequestXLogStreaming rounds down,
not up, so it isn't the cause of the gap. The actual mechanism is
that archive recovery processes whole segment files: after both nodes
replay the same archived segment N, the cascade's next read position
lands at the start of segment N+1, while the upstream's
GetStandbyFlushRecPtr returns replayPtr inside segment N.

Changes from v5:

- Updated the code comment and commit message to describe the correct
root cause (archive recovery segment granularity, not
RequestXLogStreaming truncation).

- Reset the catchup state when the upstream is no longer behind.
Without this, if the walreceiver successfully streams, the
connection breaks, and it loops back to find itself ahead again,
the stale deadline from the previous wait would cause an immediate
timeout.

Two patches attached: v6-0001 for master (extends the
walrcv_identify_system API) and v6-backpatch-0001 for stable branches
(global variable to preserve ABI).

Best regards,
Marco

Attachment Content-Type Size
v6-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-arc.patch text/x-patch 16.5 KB
v6-0001-Fix-cascading-standby-reconnect-failure-after-arc.patch text/x-patch 18.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2026-03-21 10:55:06 Re: [19] CREATE SUBSCRIPTION ... SERVER
Previous Message Marco Nenciarini 2026-03-21 10:37:40 Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery