Re: Race conditions with checkpointer and shutdown

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race conditions with checkpointer and shutdown
Date: 2019-04-28 00:56:51
Message-ID: 2766.1556413011@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have spent a fair amount of time trying to replicate these failures
locally, with little success. I now think that the most promising theory
is Munro's idea in [1] that the walreceiver is hanging up during its
unsafe attempt to do ereport(FATAL) from inside a signal handler. It's
extremely plausible that that could result in a deadlock inside libc's
malloc/free, or some similar place. Moreover, if that's what's causing
it, then the windows for trouble are fixed by the length of time that
malloc might hold internal locks, which fits with the results I've gotten
that inserting delays in various promising-looking places doesn't do a
thing towards making this reproducible.

Even if that isn't the proximate cause of the current reports, it's
clearly trouble waiting to happen, and we should get rid of it.
Accordingly, see attached proposed patch. This just flushes the
"immediate interrupt" stuff in favor of making sure that
libpqwalreceiver.c will take care of any signals received while
waiting for input.

The existing code does not use PQsetnonblocking, which means that it's
theoretically at risk of blocking while pushing out data to the remote
server. In practice I think that risk is negligible because (IIUC) we
don't send very large amounts of data at one time. So I didn't bother to
change that. Note that for the most part, if that happened, the existing
code was at risk of slow response to SIGTERM anyway since it didn't have
Enable/DisableWalRcvImmediateExit around the places that send data.

My thought is to apply this only to HEAD for now; it's kind of a large
change to shove into the back branches to handle a failure mode that's
not been reported from the field. Maybe we could back-patch after we
have more confidence in it.

regards, tom lane

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2B%3D1G98m61VjNS-qGboJPwdZcF%2BrAPu2eC4XuWRTR3UPw%40mail.gmail.com

Attachment Content-Type Size
dont-try-to-exit-from-signal-handler-1.patch text/x-diff 12.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2019-04-28 01:36:22 Re: Improve search for missing parent downlinks in amcheck
Previous Message Peter Geoghegan 2019-04-28 00:36:56 Re: Improve search for missing parent downlinks in amcheck