walsender bug: stuck during shutdown

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Chloe Dives <Chloe(dot)Dives(at)cantabcapital(dot)com>, Chris Wilson <chris(dot)wilson(at)cantabcapital(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Subject: walsender bug: stuck during shutdown
Date: 2020-11-23 20:52:53
Message-ID: 20201123205253.GA10075@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello

Chloe Dives reported that sometimes a walsender would become stuck
during shutdown and *not* shutdown, thus preventing postmaster from
completing the shutdown cycle. This has been observed to cause the
servers to remain in such state for several hours.

After a lengthy investigation and thanks to a handy reproducer by Chris
Wilson, we found that the problem is that WalSndDone wants to avoid
shutting down until everything has been sent and acknowledged; but this
test is coded in a way that ignores the possibility that we have never
received anything from the other end. In that case, both
MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition
in WalSndDone to terminate the loop is never fulfilled. So the
walsender is looping forever and never terminates, blocking shutdown of
the whole instance.

The attached patch fixes the problem by testing for the problematic
condition.

Apparently this problem has existed forever. Fujii-san almost patched
for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch.

--
Álvaro Herrera

Attachment Content-Type Size
0001-Don-t-loop-forever-in-WalSndDone.patch text/x-diff 1.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-11-23 20:55:05 optimizer/clauses.h needn't include access/htup.h
Previous Message David Rowley 2020-11-23 20:36:55 Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path