| From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
|---|---|
| To: | Andrey Silitskiy <a(dot)silitskiy(at)postgrespro(dot)ru> |
| Cc: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ronan Dunklau <ronan(at)dunklau(dot)fr>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com> |
| Subject: | Re: Exit walsender before confirming remote flush in logical replication |
| Date: | 2026-03-31 17:33:20 |
| Message-ID: | CAHGQGwELRshB7z4PdkON1AGXvFu88s4vbF61TX=Tn-2_c4_pYg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Mar 30, 2026 at 12:14 PM Andrey Silitskiy
<a(dot)silitskiy(at)postgrespro(dot)ru> wrote:
>
> On Mar 29, 2026 Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > One possible idea why hand may happen for is is that
> > WalSndWaitForWal() has missing WalSndCheckShutdownTimeout() call.
>
> On Mar 25, 2026 Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > I tested wal_sender_shutdown_timeout under several configurations and
> > encountered a case where the primary shutdown got stuck, ...
>
> Thanks for your help in finding the issue!
>
> I reproduced the problem, in this configuration it turned out that the
> walsender was not terminated by wal_sender_shutdown_timeout in
> WalSndWaitForWal(), but only when the physical slot was checked for
> inactive flag,
> which caused shutdown to hang.
Regarding the issue I reported, Vitaly's analysis upthread seems correct to me.
If WalSndComputeSleeptime() is called before WalSndCheckShutdownTimeout(), then
shutdown_request_timestamp is still 0, so wal_sender_shutdown_timeout is not
taken into account even though shutdown has already been requested
(i.e., got_STOPPING || got_SIGUSR2 is true).
In that case, if wal_sender_timeout is large, the computed sleep time can also
be large, and the walsender may wait in WalSndWait() longer than intended.
To fix this, walsender should call WalSndCheckShutdownTimeout() first so that
shutdown_request_timestamp is set before computing the sleep time. The v7 patch
already does this, which looks good to me. The comments for that
WalSndCheckShutdownTimeout() might need an update, though.
Regards,
--
Fujii Masao
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fujii Masao | 2026-03-31 17:34:59 | Re: Exit walsender before confirming remote flush in logical replication |
| Previous Message | Masahiko Sawada | 2026-03-31 17:28:11 | Re: Initial COPY of Logical Replication is too slow |