Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Marco Nenciarini <marco(dot)nenciarini(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Date: 2026-03-18 01:51:28
Message-ID: CABPTF7WnTfBTL-OiDJqnAhm-SoyRoT+jW5qE0MfHTtv1vOaSSA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 17, 2026 at 8:20 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> On Tue, Mar 17, 2026 at 7:56 PM Marco Nenciarini
> <marco(dot)nenciarini(at)enterprisedb(dot)com> wrote:
> >
> > Thanks for verifying the fix and improving the test, Xuneng.
> >
> > The wait_for_event() synchronization is a nice addition — it gives
> > deterministic proof that the walreceiver actually entered the
> > upstream-catchup path. The scoped log window with slurp_file() is
> > also cleaner than the broad log_contains() I had before.
> >

After thinking about this more, I’m less satisfied and convinced with
polling at wal_retrieve_retry_interval. If the upstream stalls for a
long time, or permanently, the walreceiver can loop indefinitely,
leaving startup effectively pinned in the streaming path instead of
switching to other WAL sources. In that case, repeated “ahead of flush
position” log entries can also become noisy. On the other hand, if the
upstream catches up quickly, walreceiver still won’t notice until the
next interval, adding unnecessary latency of up to one full
wal_retrieve_retry_interval.

--
Best,
Xuneng

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2026-03-18 01:56:54 Re: Import Statistics in postgres_fdw before resorting to sampling.
Previous Message Michael Paquier 2026-03-18 01:43:29 Re: Return pg_control from pg_backup_stop().