Re: bgworker crashed or not?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Antonin Houska <antonin(dot)houska(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Subject: Re: bgworker crashed or not?
Date: 2014-05-06 17:05:22
Message-ID: CA+TgmoZyW_xVDyew0pNVTe43VbNhPMp2kGi+kcw+f7fM-GXy7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, May 4, 2014 at 9:57 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> On 30/04/14 23:35, Robert Haas wrote:
>> On Mon, Apr 28, 2014 at 4:19 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com>
>> wrote:
>>> On 28/04/14 16:27, Robert Haas wrote:
>>>> It seems we have consensus on what to do about this, but what we
>>>> haven't got is a patch.
>>>
>>>
>>> If you mean the consensus that exit status 0 should mean permanent stop
>>> then
>>> I think the patch can be as simple as attached.
>>
>> Hmm. Well, at the very least, you need to update the comment.
>
> Obviously, also docs, the above was just demonstration that the patch is
> simple (and also all I could manage to write and test on the way from
> airport to hotel).
>
> The attached patch is more proper.

Hmm, I see some problems here. The current comment for
bgworker_quickdie() says that we do exit(0) there rather than exit(2)
to avoid having the postmaster delay restart based on
bgw_restart_time, but with the proposed change in the meaning of
exit(2), that would have the effect of unregistering the worker, which
I guess is why you've changed it to call exit(2) instead, but then the
bgw_restart_delay applies, which I think we do not want. To make
matters more confusing, the existing comment is actually only correct
for non-shmem-connected workers - shmem-connected workers will treat
an exit status of anything other than 0 or 1 in exactly the same
fashion as a failure to disengage the deadman switch.

Which brings up another point: the behavior of non-shmem-connected
workers is totally bizarre. An exit status other than 0 or 1 is not
treated as a crash requiring a restart, but failure to disengage the
deadman switch is still treated as a crash requiring a restart. Why?
If the workers are not shmem-connected, then no crash requires a
system-wide restart. Of course, there's the tiny problem that we
aren't actually unmapping shared memory from supposedly non-shmem
connected workers, which is a different bug, but ignoring that for the
moment there's no reason for this logic to be like this.

What I'm inclined to do is change the logic so that:

(1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so
that anything which is still registered gets restarted immediately.

(2) If a shmem-connected backend fails to release the deadman switch
or exits with an exit code other than 0 or 1, we crash-and-restart. A
non-shmem-connected backend never causes a crash-and-restart.

(3) When a background worker exits without triggering a
crash-and-restart, an exit code of precisely 0 causes the worker to be
unregistered; any other exit code has no special effect, so
bgw_restart_time controls.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2014-05-06 17:08:01 Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Previous Message Andres Freund 2014-05-06 16:55:05 Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers