Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgreSQL(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Date: 2013-06-24 19:29:45
Message-ID: 20130624192945.GE4051@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

MauMau escribió:
> From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>

> >Actually, in further testing I noticed that the fast-path you introduced
> >in BackendCleanup (or was it HandleChildCrash?) in the immediate
> >shutdown case caused postmaster to fail to clean up properly after
> >sending the SIGKILL signal, so I had to remove that as well. Was there
> >any reason for that?
>
> You are talking about the code below at the beginning of
> HandleChildCrash(), aren't you?
>
> /* Do nothing if the child terminated due to immediate shutdown */
> if (Shutdown == ImmediateShutdown)
> return;
>
> If my memory is correct, this was necessary; without this,
> HandleChildCrash() or LogChildExit() left extra messages in the
> server log.
>
> LOG: %s (PID %d) exited with exit code %d
> LOG: terminating any other active server processes
>
> These messages are output because postmaster sends SIGQUIT to its
> children and wait for them to terminate. The children exit with
> non-zero status when they receive SIGQUIT.

Yeah, I see that --- after removing that early exit, there are unwanted
messages. And in fact there are some signals sent that weren't
previously sent. Clearly we need something here: if we're in immediate
shutdown handler, don't signal anyone (because they have already been
signalled) and don't log any more messages; but the cleaning up of
postmaster's process list must still be carried out.

Would you please add that on top of the attached cleaned up version of
your patch?

Noah Misch escribió:
> On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote:

> > the clients at the immediate shutdown. The code gets much simpler. In
> > addition, it may be better that we similarly send SIGKILL in backend
> > crash (FatalError) case, eliminate the use of SIGQUIT and remove
> > quickdie() and other SIGQUIT handlers.
>
> My take is that the client message has some value, and we shouldn't just
> discard it to simplify the code slightly. Finishing the shutdown quickly has
> value, of course. The relative importance of those things should guide the
> choice of a timeout under method #2, but I don't see a rigorous way to draw
> the line. I feel five seconds is, if anything, too high.

There's obviously a lot of disagreement on 5 seconds being too high or
too low. We have just followed SysV init's precedent of waiting 5
seconds by default between sending signals TERM and QUIT during a
shutdown. I will note that during a normal shutdown, services are
entitled to do much more work than just signal all their children to
exit immediately; and yet I don't find much evidence that this period is
inordinately short. I don't feel strongly that it couldn't be shorter,
though.

> What about using deadlock_timeout? It constitutes a rough barometer on the
> system's tolerance for failure detection latency, and we already overload it
> by having it guide log_lock_waits. The default of 1s makes sense to me for
> this purpose, too. We can always add a separate immediate_shutdown_timeout if
> there's demand, but I don't expect such demand. (If we did add such a GUC,
> setting it to zero could be the way to request method 1. If some folks
> strongly desire method 1, that's probably the way to offer it.)

I dunno. Having this be configurable seems overkill to me. But perhaps
that's the way to satisfy most people: some people could set it very
high so that they could have postmaster wait longer if they believe
their server is going to be overloaded; people wishing immediate SIGKILL
could get that too, as you describe.

I think this should be a separate patch, however.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
reliable_immediate_shutdown-2.patch text/x-diff 8.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-06-24 19:39:26 Re: dump difference between 9.3 and master after upgrade
Previous Message Josh Berkus 2013-06-24 19:24:29 PLR does not install language templates