Re: Exit walsender before confirming remote flush in logical replication

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andrey Silitskiy <a(dot)silitskiy(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ronan Dunklau <ronan(at)dunklau(dot)fr>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>
Subject: Re: Exit walsender before confirming remote flush in logical replication
Date: 2026-04-08 08:11:50
Message-ID: CAHGQGwHoS7SPNS6r9Rw3Wq-_1Bnyq4raLemOCcm=sS=+CLnifw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 8, 2026 at 4:05 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> I have some CF entries failed on this test case as well, so I tried to look into the problem.

Thanks for working on this, much appreciated!

> Once entering WalSndDone(), it might call pg_flush() and get stuck:
> ```
> if (WalSndCaughtUp && sentPtr == replicatedPtr &&
> !pq_is_send_pending())
> {
> QueryCompletion qc;
>
> /* Inform the standby that XLOG streaming is done */
> SetQueryCompletion(&qc, CMDTAG_COPY, 0);
> EndCommand(&qc, DestRemote, false);
> pq_flush();
>
> proc_exit(0);
> ```
>
> And once stuck, it will never get back to WalSndCheckShutdownTimeout(), so the new GUC timeout cannot rescue it.

pq_flush() is called when WalSndCaughtUp && sentPtr == replicatedPtr
&& !pq_is_send_pending().
Under these conditions, I was thinking that we can assume the kernel send
buffer isn't full, so pq_flush() (i.e., the send() call) can copy the data
without blocking and return immediately.

I'm not very familiar with FreeBSD, but based on [1], I wonder if this
assumption may not hold there, and pq_flush() could still block....

Regards,

[1] https://man.freebsd.org/cgi/man.cgi?unix(4)#BUFFERING

> Due to the local nature of the Unix-domain sockets, they do not imple-
> ment send buffers. The send(2) and write(2) families of system calls
> attempt to write data to the receive buffer of the destination socket.

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message cca5507 2026-04-08 08:20:47 Re: tuple radix sort
Previous Message Zsolt Parragi 2026-04-08 07:59:33 Re: Add ldapservice connection parameter