Re: checkpointer code behaving strangely on postmaster -T

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer code behaving strangely on postmaster -T
Date: 2012-05-11 20:50:01
Message-ID: 15684.1336769401@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
>> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
>>> I noticed while doing some tests that the checkpointer process does not
>>> recover very nicely after a backend crashes under postmaster -T (after
>>> all processes have been kill -CONTd, of course, and postmaster told to
>>> shutdown via Ctrl-C on its console). For some reason it seems to get
>>> stuck on a loop doing sleep(0.5s) In other case I caught it trying to
>>> do a checkpoint, but it was progressing a single page each time and then
>>> sleeping. In that condition, the checkpoint took a very long time to
>>> finish.

>> Is this still a problem as of HEAD? I think I've fixed some issues in
>> the checkpointer's outer loop logic, but not sure if what you saw is
>> still there.

> Yep, it's still there as far as I can tell. A backtrace from the
> checkpointer shows it's waiting on the latch.

I'm confused about what you did here and whether this isn't just pilot
error. If you run with -T then the postmaster will just SIGSTOP the
remaining child processes, but then it will sit and wait for them to
die, since the state machine expects them to react as though they'd been
sent SIGQUIT. If you SIGCONT any of them then that process will resume,
totally ignorant that it's supposed to die. So "kill -CONTd, of course"
makes no sense to me. I tried killing one child with -KILL, then
sending SIGINT to the postmaster, then killing the remaining
already-stopped children, and the postmaster did exit as expected after
the last child died.

So I don't see any bug here. And, after closer inspection, your
previous proposed patch is quite bogus because checkpointer is not
supposed to stop yet when the other processes are being terminated
normally.

Possibly it'd be useful to teach the postmaster more thoroughly about
SIGSTOP and have a way for it to really kill the remaining children
after you've finished investigating their state. But frankly this
is the first time I've heard of anybody using that feature at all;
I always thought it was a vestigial hangover from days when the kernel
was too stupid to write separate core dump files for each backend.
I'd rather remove SendStop than add more complexity there.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2012-05-11 21:19:13 Re: checkpointer code behaving strangely on postmaster -T
Previous Message Simon Riggs 2012-05-11 20:45:42 Re: WalSndWakeup() and synchronous_commit=off