| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com> |
| Cc: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery |
| Date: | 2026-03-17 01:04:16 |
| Message-ID: | CABPTF7UEudN4OAifnORwX3A0OSeZaAA5i0xDRTj97NCuiQMCyg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Thanks for the patch.
On Tue, Mar 17, 2026 at 5:49 AM Marco Nenciarini
<marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
>
> Attached is a v2 patch that implements the "handshake clamp" approach
> Xuneng suggested. Rather than tracking lastStreamedFlush in
> process-local state (which doesn't survive a cascade restart, as
> Fujii-san demonstrated), it uses the WAL flush position already
> returned by IDENTIFY_SYSTEM.
>
> The walreceiver now checks the upstream's flush position before issuing
> START_REPLICATION. If the requested startpoint is ahead (on the same
> timeline), it waits for wal_retrieve_retry_interval and retries. This
> works across restarts since it queries the upstream's live position on
> every connection attempt, and requires no new state variables.
>
> When timelines differ, we let START_REPLICATION handle the timeline
> negotiation as before.
>
> The patch includes a TAP test (053_cascade_reconnect.pl) that
> reproduces the scenario and verifies the fix.
>
I haven’t looked into it in detail yet, but it looks good overall.
I’ll test it further and verify that the issue has been resolved.
--
Best,
Xuneng
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-03-17 01:12:13 | Re: tablecmds: reject CLUSTER ON for partitioned tables earlier |
| Previous Message | Haibo Yan | 2026-03-17 00:28:02 | Re: Eliminating SPI / SQL from some RI triggers - take 3 |