Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

From: Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Date: 2026-05-06 11:27:48
Message-ID: CA+nrD2eVHHNpKkEc=RsPkcbe033EyZqa_1YTFcSLqwCfZ9r2xA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Xuneng,

You're right that polling isn't ideal. For a backpatchable bug fix
though, the trade-off seems reasonable: the change is contained in
the walreceiver, doesn't touch the wire protocol, and applies to all
back branches. Exploring better designs would be worthwhile but
probably belongs in a separate effort.

Best regards,
Marco

On Fri, May 1, 2026 at 4:57 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:

> Hi Marco,
>
> On Tue, Apr 28, 2026 at 12:50 AM Marco Nenciarini <
> marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
>
>> v7 patches attached. No code changes from v6, just rebased on
>> current master to remove minor offset, and the backpatch file is
>> renamed with a "nocfbot-" prefix so the commitfest bot picks up
>> only the master patch.
>>
>>
>> On Mon, Apr 27, 2026 at 6:00 PM Marco Nenciarini <
>> marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
>>
>>> Registered in PG20-1: https://commitfest.postgresql.org/patch/6716/
>>>
>>> On Sat, Mar 21, 2026 at 11:52 AM Marco Nenciarini <
>>> marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
>>>
>>>> Here are the v6 patches.
>>>>
>>>> Xuneng correctly pointed out that RequestXLogStreaming rounds down,
>>>> not up, so it isn't the cause of the gap. The actual mechanism is
>>>> that archive recovery processes whole segment files: after both nodes
>>>> replay the same archived segment N, the cascade's next read position
>>>> lands at the start of segment N+1, while the upstream's
>>>> GetStandbyFlushRecPtr returns replayPtr inside segment N.
>>>>
>>>> Changes from v5:
>>>>
>>>> - Updated the code comment and commit message to describe the correct
>>>> root cause (archive recovery segment granularity, not
>>>> RequestXLogStreaming truncation).
>>>>
>>>> - Reset the catchup state when the upstream is no longer behind.
>>>> Without this, if the walreceiver successfully streams, the
>>>> connection breaks, and it loops back to find itself ahead again,
>>>> the stale deadline from the previous wait would cause an immediate
>>>> timeout.
>>>>
>>>> Two patches attached: v6-0001 for master (extends the
>>>> walrcv_identify_system API) and v6-backpatch-0001 for stable branches
>>>> (global variable to preserve ABI).
>>>>
>>>
> Polling at intervals stil seems not good to me. But I don't have a better
> idea for now.
>
> --
> Best,
> Xuneng
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2026-05-06 11:39:18 Re: FOR PORTION OF does not recompute GENERATED STORED columns that depend on the range column
Previous Message vignesh C 2026-05-06 11:25:44 Re: Proposal: Conflict log history table for Logical Replication