| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz> |
| Cc: | Rahila Syed <rahilasyed90(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Add WALRCV_CONNECTING state to walreceiver |
| Date: | 2026-01-19 15:35:28 |
| Message-ID: | CABPTF7V3qZy76ryPUsqGwM4vOZ_f-HUap8qKom82FY8gY7K=DA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Michael,
On Mon, Jan 19, 2026 at 8:13 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Sun, Jan 11, 2026 at 08:56:57PM +0800, Xuneng Zhou wrote:
> > After some thoughts, I’m more inclined toward a startup-process–driven
> > approach. Marking the status as streaming immediately after the
> > connection is established seems not provide sufficient accuracy for
> > monitoring purposes. Introducing an intermediate state, such as
> > connected, would help reduce confusion when the startup process is
> > stalled and would make it easier for users to detect and diagnose
> > anomalies.
> >
> > V4 whitelisted CONNECTED and CONNECTING in WalRcvWaitForStartPosition
> > to handle valid stream termination scenarios without triggering a
> > FATAL error.
> >
> > Specifically, the walreceiver may need to transition to WAITING (idle) if:
> > 1. 'CONNECTED': The handshake succeeded (COPY_BOTH started), but
> > the stream ended before any WAL was applied (e.g., timeline divergence
> > detected mid-stream).
> > 2. 'CONNECTING': The handshake completed (START_REPLICATION
> > acknowledged), but the primary declined to stream (e.g., no WAL
> > available on the requested timeline).
> >
> > In both cases, the receiver should pause and await a new timeline or
> > restart position from the startup process.
>
> This stuff depends on the philosophical difference you want to put
> behind "connecting" and "streaming". My own opinion is that it is a
> non-starter to introduce more states that can be set by the startup
> process, and that a new state should reflect what we do in the code.
> We already have some of that for in the start and stop phases because
> we want some ordering when the WAL receiver process is spawned and at
> shutdown. That's just a simple way to say that we should not rely on
> more static variables to control how to set one or more states, and I
> don't see why that's actually required here? initialApplyPtr and
> force_reply are what I see as potential recipes for more bugs in the
> long term, as showed in the first approach. The second patch,
> introducing a similar new complexity with walrcv_streaming_set, is no
> better in terms of complexity added.
> The main take that I can retrieve from this thread is that it may take
> time between the moment we begin a WAL receiver in WalReceiverMain(),
> where walRcvState is switched to WALRCV_STREAMING, and the moment we
> actually have established a connection, location where "first_stream =
> false" (which is just to track if a WAL receiver is restarting,
> actually) after walrcv_startstreaming() has returned true, so as far
> as I can see you would be happy enough with the addition of a single
> state called CONNECTING, set at the beginning of WalReceiverMain()
> instead of where STREAMING is set now. The same would sound kind of
> true for WalRcvWaitForStartPosition(), because we are not actively
> streaming yet, still we are marking the WAL receiver as streaming, so
> the current code feels like we are cheating as if we define
> "streaming" as a WAL receiver that has already done an active
> connection. We also want the WAL receiver to be killable by the
> startup process while in "connecting" or "streaming" more.
>
> Hence I would suggest something like the following guidelines:
> - Add only a CONNECTING state. Set this state where we switch the
> state to "streaming" now, aka the two locations in the tree now.
> - Switch to STREAMING once the connection has been established, as
> returned by walrcv_startstreaming(), because we are acknowledging *in
> the code* that we have started streaming successfully.
> - Update the docs to reflect the new state, because this state can
> show up in the system view pg_stat_wal_receiver.
> - I am not convinved by what we gain with a CONNECTED state, either.
> Drop it.
> - The fact that we'd want to switch the state once the startup process
> has acknowleged the reception of the first byte from the stream is
> already something we track in the WAL receiver, AFAIK.
Thank you for the detailed feedback. I agree with your analysis — the
simpler approach seems preferable and should be sufficient in most
cases. Tightly coupling the startup process with the WAL receiver to
set state is not very ideal. I'll post v5 with the simplified
walreceiver changes as you suggested shortly.
> I think that
> there would be a point in expanding the SQL functions to report more
> states of the startup process, including the data received by the
> startup process, but we should not link that to the state of the WAL
> receiver. An extra reason to not do that: WAL receivers are not the
> only source feeding data to the startup process, we could have data
> pushed to pg_wal/, or archive commands/modules doing this job.
+1. I'll prepare a separate patch to expose startup process metrics
like pg_stat_get_wal_receiver does. This would complement
pg_stat_wal_receiver without coupling the two subsystems.
--
Best,
Xuneng
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2026-01-19 15:50:01 | Re: tablecmds: clarify recurse vs recusing |
| Previous Message | jian he | 2026-01-19 15:06:57 | Re: CREATE SCHEMA ... CREATE DOMAIN support |