Re: Condition variable live lock

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Condition variable live lock
Date: 2018-01-09 00:02:23
Message-ID: 20412.1515456143@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Another loose end that I'm seeing here is that while a process waiting on
>> a condition variable will respond to a cancel or die interrupt, it will
>> not notice postmaster death. This seems unwise to me. I think we should
>> adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
>> condition and just do a summary proc_exit(1) if it sees that. I'd even
>> argue that that is a back-patchable bug fix.

> Yeah. As far as I know so far, every place where we wait on a
> WaitEventSet falls into one of 4 categories when it comes to
> postmaster death:

> 1. We proc_exit(1) if WL_POSTMASTER_DEATH is returned.
> 2. We ereport(FATAL) if WL_POSTMASTER_DEATH is returned.
> 3. We asked for WL_POSTMASTER_DEATH in the WaitEventSet, but we don't
> actually bother checking for it in the returned value. Instead we
> call PostmasterIsAlive() every time through the loop (pgarch.c,
> syncrep.c, walsender.c and walreceiver.c).
> 4. We didn't ask for WL_POSTMASTER_DEATH.

> Do I have that right? I guess category 3 is suboptimal especially on
> some clunky but loveable kernels[1] and all cases of 4 including this
> one are probably bugs. That makes me wonder why we don't change the
> WaitEventSet API so that it calls proc_exit(1) for you by default if
> you didn't ask to receive WL_POSTMASTER_DEATH explicitly, to handle
> category 1 for free. If you asked for WL_POSTMASTER_DEATH explicitly
> then we could return it, to support category 2 callers that want to do
> something different. Just a thought.

Yeah, that's worth thinking about, because I'm pretty sure that we have
had this type of bug before. We have to be a bit careful because
Andres is thinking of using the WaitEventState support in the postmaster
loop, where it definitely shouldn't check WL_POSTMASTER_DEATH, and it
is probably also possible to reach some of those calls in standalone
backends, which shouldn't either. So I'm imagining:

1. The WaitEventSet code always includes WL_POSTMASTER_DEATH checking
if IsUnderPostmaster --- and conversely, if !IsUnderPostmaster, it should
ignore any caller request for that.

2. If caller specifies WL_POSTMASTER_DEATH then we'll return that
flag bit, otherwise just exit(1) --- or should the default be
ereport(FATAL)?

3. Remove explicit specification of WL_POSTMASTER_DEATH anywhere that
the default handling is OK, which is probably almost everywhere.
Get rid of those now-redundant PostmasterIsAlive checks, too.

I'm not real sure BTW why we have some callers that ereport and some
that just exit(1). Seems like it would be better to be consistent,
though I'm not entirely sure which behavior to standardize on.

(Of course, this would only be an appropriate thing to do in HEAD.
In v10 I think we should do the narrow fix I suggested up top.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-01-09 00:12:08 Re: Unimpressed with pg_attribute_always_inline
Previous Message Tom Lane 2018-01-08 23:50:33 Re: Condition variable live lock