Re: SIGQUIT handling, redux

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: SIGQUIT handling, redux
Date: 2020-09-10 16:56:55
Message-ID: 251830.1599757015@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Sep 9, 2020 at 10:07 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> bgworker_die (SIGTERM)
>> Calls ereport(FATAL). This is surely not any safer than, say,
>> quickdie(). No, it's worse, because at least that won't try
>> to go out via proc_exit().

> I think bgworker_die() is pretty much a terrible idea.

Yeah. It'd likely be better to insist that bgworkers handle SIGTERM
the same way backends do, via CHECK_FOR_INTERRUPTS. Not sure how big
a change that would be.

> I think that the only way this could actually
> be safe is if you have a background worker that never uses ereport()
> itself, so that the ereport() in the signal handler can't be
> interrupting one that's already happening.

That doesn't begin to make it safe, because quite aside from anything
that happens in elog.c, we're then going to go out via proc_exit()
which will invoke who knows what.

>> StandbyDeadLockHandler (from SIGALRM)
>> StandbyTimeoutHandler (ditto)
>> Calls CancelDBBackends, which just for starters tries to acquire
>> an LWLock. I think the only reason we've gotten away with this
>> for this long is the high probability that by the time either
>> timeout fires, we're going to be blocked on a semaphore.

> Yeah, I'm not sure these are so bad. In fact, in the deadlock case, I
> believe the old coding was designed to make sure we *had to* be
> blocked on a semaphore, but I'm not sure whether that's still true.

Looking at this closer, the only code that could get interrupted
by these timeouts is a call to ProcWaitForSignal, which is

void
ProcWaitForSignal(uint32 wait_event_info)
{
(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
wait_event_info);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}

Interrupting the latch operations might be safe enough,
although now I'm casting a side-eye at Munro's recent
changes to share WaitEvent state all over the place.
I wonder whether blocking on an LWLock inside the
signal handler will break an interrupted WaitLatch.

Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
Could we take that out?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2020-09-10 17:08:30 Re: PostgreSQL 13 Release Timeline
Previous Message Peter Eisentraut 2020-09-10 16:53:14 Re: Collation versioning