Re: checkpointer code behaving strangely on postmaster -T

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer code behaving strangely on postmaster -T
Date: 2012-05-11 21:44:50
Message-ID: 17504.1336772690@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of vie may 11 16:50:01 -0400 2012:
>> I'm confused about what you did here and whether this isn't just pilot
>> error.

> The sequence of events is:
> postmaster -T
> crash a backend
> SIGINT postmaster
> SIGCONT all child processes

> My expectation is that postmaster should exit normally after this.

Well, my expectation is that the postmaster should wait for the children
to finish dying, and then exit rather than respawn anything. It is not
on the postmaster's head to make them die anymore, because it already
(thinks it) sent them SIGQUIT. Using SIGCONT here is pilot error.

> Maybe we can consider this to be just pilot error, but then why do all
> other processes exit normally?

The reason for that is that the postmaster's SIGINT interrupt handler
(lines 2163ff) sent them SIGTERM, without bothering to notice that we'd
already sent them SIGQUIT/SIGSTOP; so once you CONT them they get the
SIGTERM and drop out normally. That handler knows it should not signal
the checkpointer yet, so the checkpointer doesn't get the memo. But the
lack of a FatalError check here is just a simplicity of implementation
thing; it should not be necessary to send any more signals once we are
in FatalError state. Besides, this behavior is all wrong for a crash
recovery scenario: there is no guarantee that shared memory is in good
enough condition for SIGTERM shutdown to work. And we *definitely*
don't want the checkpointer trying to write a shutdown checkpoint.

>> So I don't see any bug here. And, after closer inspection, your
>> previous proposed patch is quite bogus because checkpointer is not
>> supposed to stop yet when the other processes are being terminated
>> normally.

> Well, it does send the signal only when FatalError is set. So it should
> only affect -T behavior.

If FatalError is set, it should not be necessary to send any more
signals, period, because we already tried to kill every child. If we
need to defend against somebody using SIGSTOP/SIGCONT inappropriately,
it would take a lot more thought (and code) than this, and it would
still be extremely fragile because a SIGCONT'd backend is going to be
executing against possibly-corrupt shared memory.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2012-05-11 21:52:39 WIP: parameterized function scan
Previous Message Alvaro Herrera 2012-05-11 21:19:13 Re: checkpointer code behaving strangely on postmaster -T