Quick Links

Re: Exit walsender before confirming remote flush in logical replication

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Andrey Silitskiy <a(dot)silitskiy(at)postgrespro(dot)ru>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ronan Dunklau <ronan(at)dunklau(dot)fr>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>
Subject:	Re: Exit walsender before confirming remote flush in logical replication
Date:	2026-03-31 17:34:59
Message-ID:	CAHGQGwE-vmNX_DGo5k4YiCzAwLvT0qijnSK9P1jwXm5rUS-sKw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 31, 2026 at 2:31 AM Andrey Silitskiy
<a(dot)silitskiy(at)postgrespro(dot)ru> wrote:
>
> On Mar 30, 2026 Andrey Silitskiy
> <a(dot)silitskiy(at)postgrespro(dot)ru>wrote:
> > It is worth noting that in this configuration, the second walsender may
> > terminate due to inactive slot before the wal_sender_shutdown_timeout
> > terminates the process, ...
>
> Updated. In this configuration, in some situations walsender processes
> may already have been terminated by the time of fast shutdown.
> And in this case, the fast shutdown will be successful, but it will not
> be the reason for walsender termination and the logs will be different.
> Decided to leave the new shutdown test case, but not check the logs in it.

Thanks for updating the patch!

+ <varlistentry id="guc-wal_sender_shutdown_timeout"
xreflabel="wal_sender_shutdown_timeout">

Placing wal_sender_shutdown_timeout next to wal_sender_timeout would improve
readability, since they are closely related. Also, the "id" should probably use
hyphens (i.e., guc-wal-sender-shutdown-timeout) rather than underscores.

This parameter can be set individually
+ for each walsender.

This sentence doesn't seem necessary, as similar GUCs don't mention this.

+ If disabled, the walsender will wait for all WAL data to be
+ successfully flushed on the receiver side before exiting the process.

This description is a bit misleading, since walsender basically waits for WAL
replication regardless. It would be clearer to describe the behavior in terms
of waiting for replication to complete, and how the timeout affects that.
For example,

------------------------------------
Specifies the maximum time the server waits during shutdown for all
WAL data to be replicated to the receiver. If this value is specified
without units, it is taken as milliseconds. A value of -1 (the
default) disables the timeout mechanism.

When replication is in use, the sending server normally waits until
all WAL data has been transferred to the receiver before completing
shutdown. This helps keep sender and receiver in sync after shutdown,
which is especially important for physical replication switchovers,
but it can delay shutdown.

If this parameter is set, the server stops waiting and completes
shutdown when the timeout expires. This can shorten shutdown time, for
example, when replication is slow on high-latency networks or when a
logical replication apply worker is blocked waiting for locks.
However, in this case the sender and receiver may be out of sync after
shutdown.
------------------------------------

+ Users will stop waiting if a fast shutdown is requested. However, if
+ <varname>wal_sender_shutdown_timeout</varname> is not set, the server will
+ not fully shutdown until all outstanding WAL records are transferred to
+ the currently connected standby servers. This waiting applies to both
+ asynchronous and synchronous replication.

Similarly, the paragraph about fast shutdown is slightly misleading for the
same reason above. How about updating this to the following, for example?

------------------------------------
Users will stop waiting if a fast shutdown is requested. However, when
using replication, the server will not fully shutdown until all
outstanding WAL records are transferred to the currently connected
standby servers, or wal_sender_shutdown_timeout (if set) expires,
regardless of whether replication is synchronous or asynchronous.
------------------------------------

+ <varname>wal_sender_shutdown_timeout</varname> is not set, the server will

Instead of referring to the parameter name directly, it would be better to use
<xref linkend="guc-wal-sender-shutdown-timeout"/> so readers can easily jump
to its description.

+ if (shutdown_request_timestamp == 0)
+ {
+ shutdown_request_timestamp = now;

It might be safer to set shutdown_request_timestamp even when
wal_sender_shutdown_timeout = -1, in case the value is changed during shutdown.
Currently, wal_sender_shutdown_timeout cannot normally be changed during
shutdown, since the postmaster does almost nothing on SIGHUP. However,
sending SIGHUP directly to the walsender can still change this parameter even
during shutdown.

+ ereport(WARNING,
+ (errmsg("terminating walsender due to wal_sender_shutdown_timeout
expiration, replication may be incomplete")));

"expiration" seems unnecessary for the sake of other similar log messages like
"terminating walsender process due to replication timeout". So omething like
"terminating walsender process due to wal_sender_shutdown_timeout" should be
enough.

It would also be better to move "replication may be incomplete" to errdetail(),
and clarify it, for example, "Walsender is terminated before all WAL data was
replicated to the receiver". Thought?

Regards,

--
Fujii Masao

In response to

Re: Exit walsender before confirming remote flush in logical replication at 2026-03-30 17:31:04 from Andrey Silitskiy

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Srinath Reddy Sadipiralla	2026-03-31 17:37:17	Re: Adding REPACK [concurrently]
Previous Message	Fujii Masao	2026-03-31 17:33:20	Re: Exit walsender before confirming remote flush in logical replication