Quick Links

Re: Race conditions with checkpointer and shutdown

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Race conditions with checkpointer and shutdown
Date:	2019-04-28 00:56:51
Message-ID:	2766.1556413011@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I have spent a fair amount of time trying to replicate these failures
locally, with little success. I now think that the most promising theory
is Munro's idea in [1] that the walreceiver is hanging up during its
unsafe attempt to do ereport(FATAL) from inside a signal handler. It's
extremely plausible that that could result in a deadlock inside libc's
malloc/free, or some similar place. Moreover, if that's what's causing
it, then the windows for trouble are fixed by the length of time that
malloc might hold internal locks, which fits with the results I've gotten
that inserting delays in various promising-looking places doesn't do a
thing towards making this reproducible.

Even if that isn't the proximate cause of the current reports, it's
clearly trouble waiting to happen, and we should get rid of it.
Accordingly, see attached proposed patch. This just flushes the
"immediate interrupt" stuff in favor of making sure that
libpqwalreceiver.c will take care of any signals received while
waiting for input.

The existing code does not use PQsetnonblocking, which means that it's
theoretically at risk of blocking while pushing out data to the remote
server. In practice I think that risk is negligible because (IIUC) we
don't send very large amounts of data at one time. So I didn't bother to
change that. Note that for the most part, if that happened, the existing
code was at risk of slow response to SIGTERM anyway since it didn't have
Enable/DisableWalRcvImmediateExit around the places that send data.

My thought is to apply this only to HEAD for now; it's kind of a large
change to shove into the back branches to handle a failure mode that's
not been reported from the field. Maybe we could back-patch after we
have more confidence in it.

regards, tom lane

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2B%3D1G98m61VjNS-qGboJPwdZcF%2BrAPu2eC4XuWRTR3UPw%40mail.gmail.com

Attachment	Content-Type	Size
dont-try-to-exit-from-signal-handler-1.patch	text/x-diff	12.8 KB

In response to

Re: Race conditions with checkpointer and shutdown at 2019-04-19 04:02:48 from Tom Lane

Responses

Re: Race conditions with checkpointer and shutdown at 2019-04-29 04:52:37 from Thomas Munro
Re: Race conditions with checkpointer and shutdown at 2019-04-29 16:35:11 from Andres Freund
Re: Race conditions with checkpointer and shutdown at 2019-04-29 17:26:09 from Ashwin Agrawal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2019-04-28 01:36:22	Re: Improve search for missing parent downlinks in amcheck
Previous Message	Peter Geoghegan	2019-04-28 00:36:56	Re: Improve search for missing parent downlinks in amcheck