Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Date: 2026-05-01 02:57:08
Message-ID: CABPTF7XBj00sAoYRsL9=YqdaO1iNFLaqW7QNMS9gf7Ey8y7Gyw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Marco,

On Tue, Apr 28, 2026 at 12:50 AM Marco Nenciarini <
marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:

> v7 patches attached. No code changes from v6, just rebased on
> current master to remove minor offset, and the backpatch file is
> renamed with a "nocfbot-" prefix so the commitfest bot picks up
> only the master patch.
>
>
> On Mon, Apr 27, 2026 at 6:00 PM Marco Nenciarini <
> marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
>
>> Registered in PG20-1: https://commitfest.postgresql.org/patch/6716/
>>
>> On Sat, Mar 21, 2026 at 11:52 AM Marco Nenciarini <
>> marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
>>
>>> Here are the v6 patches.
>>>
>>> Xuneng correctly pointed out that RequestXLogStreaming rounds down,
>>> not up, so it isn't the cause of the gap. The actual mechanism is
>>> that archive recovery processes whole segment files: after both nodes
>>> replay the same archived segment N, the cascade's next read position
>>> lands at the start of segment N+1, while the upstream's
>>> GetStandbyFlushRecPtr returns replayPtr inside segment N.
>>>
>>> Changes from v5:
>>>
>>> - Updated the code comment and commit message to describe the correct
>>> root cause (archive recovery segment granularity, not
>>> RequestXLogStreaming truncation).
>>>
>>> - Reset the catchup state when the upstream is no longer behind.
>>> Without this, if the walreceiver successfully streams, the
>>> connection breaks, and it loops back to find itself ahead again,
>>> the stale deadline from the previous wait would cause an immediate
>>> timeout.
>>>
>>> Two patches attached: v6-0001 for master (extends the
>>> walrcv_identify_system API) and v6-backpatch-0001 for stable branches
>>> (global variable to preserve ABI).
>>>
>>
Polling at intervals stil seems not good to me. But I don't have a better
idea for now.

--
Best,
Xuneng

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2026-05-01 03:15:10 Re: Exit walsender before confirming remote flush in logical replication
Previous Message Richard Guo 2026-05-01 02:47:20 Re: Fix HAVING-to-WHERE pushdown with nondeterministic collations