Re: Add WALRCV_CONNECTING state to walreceiver

From: Noah Misch <noah(at)leadboat(dot)com>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Add WALRCV_CONNECTING state to walreceiver
Date: 2025-12-15 04:14:05
Message-ID: 20251215041405.4c.nmisch@google.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 14, 2025 at 06:17:34PM +0800, Xuneng Zhou wrote:
> On Sun, Dec 14, 2025 at 4:55 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > On Sun, Dec 14, 2025 at 1:14 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > V2 makes the transition from WALRCV_CONNECTING to STREAMING only when
> > > > the first valid WAL record is processed by the startup process. A new
> > > > function WalRcvSetStreaming is introduced to enable the transition.
> > >
> > > The original patch set STREAMING in XLogWalRcvFlush(). XLogWalRcvFlush()
> > > callee XLogWalRcvSendReply() already fetches applyPtr to send a status
> > > message. So I would try the following before involving the startup process
> > > like v2 does:
> > >
> > > 1. store the applyPtr when we enter CONNECTING
> > > 2. force a status message as long as we remain in CONNECTING
> > > 3. become STREAMING when applyPtr differs from the one stored at (1)
> >
> > Thanks for the suggestion. Using XLogWalRcvSendReply() for the
> > transition could make sense. My concern before is about latency in a
> > rare case: if the first flush completes but applyPtr hasn't advanced
> > yet at the time of check and then the flush stalls after that, we
> > might wait up to wal_receiver_status_interval (default 10s) before the
> > next check or indefinitely if (wal_receiver_status_interval <= 0).
> > This could be mitigated by shortening the wakeup interval while in
> > CONNECTING (step 2), which reduces worst-case latency to ~1 second.
> > Given that monitoring typically doesn't require sub-second precision,
> > this approach could be feasible.
> >
> > case WALRCV_WAKEUP_REPLY:
> > if (WalRcv->walRcvState == WALRCV_CONNECTING)
> > {
> > /* Poll frequently while CONNECTING to avoid long latency */
> > wakeup[reason] = TimestampTzPlusMilliseconds(now, 1000);
> > }
> >
> > > A possible issue with all patch versions: when the primary is writing no WAL
> > > and the standby was caught up before this walreceiver started, CONNECTING
> > > could persist for an unbounded amount of time. Only actual primary WAL
> > > generation would move the walreceiver to STREAMING. This relates to your
> > > above point about high latency. If that's a concern, perhaps this change
> > > deserves a total of two new states, CONNECTING and a state that represents
> > > "connection exists, no WAL yet applied"?
> >
> > Yes, this could be an issue. Using two states would help address it.
> > That said, when the primary is idle in this case, we might end up
> > repeatedly polling the apply status in the state before streaming if
> > we implement the 1s short-interval checking like above, which could be
> > costful. However, If we do not implement it &&
> > wal_receiver_status_interval is set to < 0 && flush stalls, the
> > walreceiver could stay in the pre-streaming state indefinitely even if
> > streaming did occur, which violates the semantics. Do you think this
> > is a valid concern or just an artificial edge case?
>
> After looking more closely, I found that true indefinite waiting
> requires ALL of:
>
> wal_receiver_status_interval <= 0 (disables status updates)
> wal_receiver_timeout <= 0
> Primary sends no keepalives
> No more WAL arrives after the first failed-check flush
> Startup never sets force_reply
>
> which is quite impossible and artificial, sorry for the noise here.

Even if indefinite wait is a negligible concern, you identified a lot of
intricacy that I hadn't pictured. That makes your startup-process-driven
version potentially more attractive. Forcing status messages like I was
thinking may also yield an unwanted flurry of them if the startup process is
slow. Let's see what the patch reviewer thinks.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2025-12-15 04:18:23 Re: Fixes a typo in tablecmd
Previous Message Michael Paquier 2025-12-15 04:08:44 Re: Fix memory leak in gist_page_items() of pageinspect