| From: | Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com> |
|---|---|
| To: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
| Cc: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery |
| Date: | 2026-03-16 21:49:44 |
| Message-ID: | CA+nrD2dRNzWAxc227uqy5tdFEk-UmK7R5965GYL9yzLzP+g6+Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Attached is a v2 patch that implements the "handshake clamp" approach
Xuneng suggested. Rather than tracking lastStreamedFlush in
process-local state (which doesn't survive a cascade restart, as
Fujii-san demonstrated), it uses the WAL flush position already
returned by IDENTIFY_SYSTEM.
The walreceiver now checks the upstream's flush position before issuing
START_REPLICATION. If the requested startpoint is ahead (on the same
timeline), it waits for wal_retrieve_retry_interval and retries. This
works across restarts since it queries the upstream's live position on
every connection attempt, and requires no new state variables.
When timelines differ, we let START_REPLICATION handle the timeline
negotiation as before.
The patch includes a TAP test (053_cascade_reconnect.pl) that
reproduces the scenario and verifies the fix.
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Fix-cascading-standby-reconnect-failure-after-arc.patch | text/x-patch | 15.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Matthias van de Meent | 2026-03-16 21:50:04 | Re: Adding REPACK [concurrently] |
| Previous Message | Michael Paquier | 2026-03-16 21:49:17 | Re: Add starelid, attnum to pg_stats and leverage this in pg_dump |