Re: Add WALRCV_CONNECTING state to walreceiver

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add WALRCV_CONNECTING state to walreceiver
Date: 2026-01-19 00:13:12
Message-ID: aW13GJn_RfTJIFCa@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 11, 2026 at 08:56:57PM +0800, Xuneng Zhou wrote:
> After some thoughts, I’m more inclined toward a startup-process–driven
> approach. Marking the status as streaming immediately after the
> connection is established seems not provide sufficient accuracy for
> monitoring purposes. Introducing an intermediate state, such as
> connected, would help reduce confusion when the startup process is
> stalled and would make it easier for users to detect and diagnose
> anomalies.
>
> V4 whitelisted CONNECTED and CONNECTING in WalRcvWaitForStartPosition
> to handle valid stream termination scenarios without triggering a
> FATAL error.
>
> Specifically, the walreceiver may need to transition to WAITING (idle) if:
> 1. 'CONNECTED': The handshake succeeded (COPY_BOTH started), but
> the stream ended before any WAL was applied (e.g., timeline divergence
> detected mid-stream).
> 2. 'CONNECTING': The handshake completed (START_REPLICATION
> acknowledged), but the primary declined to stream (e.g., no WAL
> available on the requested timeline).
>
> In both cases, the receiver should pause and await a new timeline or
> restart position from the startup process.

This stuff depends on the philosophical difference you want to put
behind "connecting" and "streaming". My own opinion is that it is a
non-starter to introduce more states that can be set by the startup
process, and that a new state should reflect what we do in the code.
We already have some of that for in the start and stop phases because
we want some ordering when the WAL receiver process is spawned and at
shutdown. That's just a simple way to say that we should not rely on
more static variables to control how to set one or more states, and I
don't see why that's actually required here? initialApplyPtr and
force_reply are what I see as potential recipes for more bugs in the
long term, as showed in the first approach. The second patch,
introducing a similar new complexity with walrcv_streaming_set, is no
better in terms of complexity added.

The main take that I can retrieve from this thread is that it may take
time between the moment we begin a WAL receiver in WalReceiverMain(),
where walRcvState is switched to WALRCV_STREAMING, and the moment we
actually have established a connection, location where "first_stream =
false" (which is just to track if a WAL receiver is restarting,
actually) after walrcv_startstreaming() has returned true, so as far
as I can see you would be happy enough with the addition of a single
state called CONNECTING, set at the beginning of WalReceiverMain()
instead of where STREAMING is set now. The same would sound kind of
true for WalRcvWaitForStartPosition(), because we are not actively
streaming yet, still we are marking the WAL receiver as streaming, so
the current code feels like we are cheating as if we define
"streaming" as a WAL receiver that has already done an active
connection. We also want the WAL receiver to be killable by the
startup process while in "connecting" or "streaming" more.

Hence I would suggest something like the following guidelines:
- Add only a CONNECTING state. Set this state where we switch the
state to "streaming" now, aka the two locations in the tree now.
- Switch to STREAMING once the connection has been established, as
returned by walrcv_startstreaming(), because we are acknowledging *in
the code* that we have started streaming successfully.
- Update the docs to reflect the new state, because this state can
show up in the system view pg_stat_wal_receiver.
- I am not convinved by what we gain with a CONNECTED state, either.
Drop it.
- The fact that we'd want to switch the state once the startup process
has acknowleged the reception of the first byte from the stream is
already something we track in the WAL receiver, AFAIK. I think that
there would be a point in expanding the SQL functions to report more
states of the startup process, including the data received by the
startup process, but we should not link that to the state of the WAL
receiver. An extra reason to not do that: WAL receivers are not the
only source feeding data to the startup process, we could have data
pushed to pg_wal/, or archive commands/modules doing this job.
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-01-19 00:25:59 Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes
Previous Message Peter Geoghegan 2026-01-18 23:51:58 Re: index prefetching