Re: Add WALRCV_CONNECTING state to walreceiver

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add WALRCV_CONNECTING state to walreceiver
Date: 2026-01-21 12:40:16
Message-ID: CABPTF7Wk+hSpdE1crJ6yiB=WLJvYducJjY9ZKJ8a0VdW78ZuKg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Jan 19, 2026 at 11:35 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi Michael,
>
> On Mon, Jan 19, 2026 at 8:13 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Sun, Jan 11, 2026 at 08:56:57PM +0800, Xuneng Zhou wrote:
> > > After some thoughts, I’m more inclined toward a startup-process–driven
> > > approach. Marking the status as streaming immediately after the
> > > connection is established seems not provide sufficient accuracy for
> > > monitoring purposes. Introducing an intermediate state, such as
> > > connected, would help reduce confusion when the startup process is
> > > stalled and would make it easier for users to detect and diagnose
> > > anomalies.
> > >
> > > V4 whitelisted CONNECTED and CONNECTING in WalRcvWaitForStartPosition
> > > to handle valid stream termination scenarios without triggering a
> > > FATAL error.
> > >
> > > Specifically, the walreceiver may need to transition to WAITING (idle) if:
> > > 1. 'CONNECTED': The handshake succeeded (COPY_BOTH started), but
> > > the stream ended before any WAL was applied (e.g., timeline divergence
> > > detected mid-stream).
> > > 2. 'CONNECTING': The handshake completed (START_REPLICATION
> > > acknowledged), but the primary declined to stream (e.g., no WAL
> > > available on the requested timeline).
> > >
> > > In both cases, the receiver should pause and await a new timeline or
> > > restart position from the startup process.
> >
> > This stuff depends on the philosophical difference you want to put
> > behind "connecting" and "streaming". My own opinion is that it is a
> > non-starter to introduce more states that can be set by the startup
> > process, and that a new state should reflect what we do in the code.
> > We already have some of that for in the start and stop phases because
> > we want some ordering when the WAL receiver process is spawned and at
> > shutdown. That's just a simple way to say that we should not rely on
> > more static variables to control how to set one or more states, and I
> > don't see why that's actually required here? initialApplyPtr and
> > force_reply are what I see as potential recipes for more bugs in the
> > long term, as showed in the first approach. The second patch,
> > introducing a similar new complexity with walrcv_streaming_set, is no
> > better in terms of complexity added.
> > The main take that I can retrieve from this thread is that it may take
> > time between the moment we begin a WAL receiver in WalReceiverMain(),
> > where walRcvState is switched to WALRCV_STREAMING, and the moment we
> > actually have established a connection, location where "first_stream =
> > false" (which is just to track if a WAL receiver is restarting,
> > actually) after walrcv_startstreaming() has returned true, so as far
> > as I can see you would be happy enough with the addition of a single
> > state called CONNECTING, set at the beginning of WalReceiverMain()
> > instead of where STREAMING is set now. The same would sound kind of
> > true for WalRcvWaitForStartPosition(), because we are not actively
> > streaming yet, still we are marking the WAL receiver as streaming, so
> > the current code feels like we are cheating as if we define
> > "streaming" as a WAL receiver that has already done an active
> > connection. We also want the WAL receiver to be killable by the
> > startup process while in "connecting" or "streaming" more.
> >
> > Hence I would suggest something like the following guidelines:
> > - Add only a CONNECTING state. Set this state where we switch the
> > state to "streaming" now, aka the two locations in the tree now.
> > - Switch to STREAMING once the connection has been established, as
> > returned by walrcv_startstreaming(), because we are acknowledging *in
> > the code* that we have started streaming successfully.
> > - Update the docs to reflect the new state, because this state can
> > show up in the system view pg_stat_wal_receiver.
> > - I am not convinved by what we gain with a CONNECTED state, either.
> > Drop it.
> > - The fact that we'd want to switch the state once the startup process
> > has acknowleged the reception of the first byte from the stream is
> > already something we track in the WAL receiver, AFAIK.
>
> Thank you for the detailed feedback. I agree with your analysis — the
> simpler approach seems preferable and should be sufficient in most
> cases. Tightly coupling the startup process with the WAL receiver to
> set state is not very ideal. I'll post v5 with the simplified
> walreceiver changes as you suggested shortly.

Please see v5 of the updated patch.

--
Best,
Xuneng

Attachment Content-Type Size
v5-0001-Add-WALRCV_CONNECTING-state-to-walreceiver.patch application/x-patch 6.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2026-01-21 13:05:46 Re: Patch: dumping tables data in multiple chunks in pg_dump
Previous Message jian he 2026-01-21 12:15:25 Re: Emitting JSON to file using COPY TO