Re: Hot standby, recovery infra

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot standby, recovery infra
Date: 2009-02-26 18:38:33
Message-ID: 49A6E1A9.5020901@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> That whole area was something I was leaving until last, since immediate
>> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
>> this before Christmas, briefly).
>
> This problem remains in current HEAD. I mean, immediate shutdown
> may be unable to kill the startup process because system() which
> executes restore_command ignores SIGQUIT while waiting.
> When I tried immediate shutdown during recovery, only the startup
> process survived. This is undesirable behavior, I think.

Yeah, we need to fix that.

> The following code should be added into RestoreArchivedFile()?
>
> ----
> if (WTERMSIG(rc) == SIGQUIT)
> exit(2);
> ----

I don't see how that helps, as we already have this in there:

signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

ereport(signaled ? FATAL : DEBUG2,
(errmsg("could not restore file \"%s\" from archive: return code %d",
xlogfname, rc)));

which means we already ereport(FATAL) if the restore command dies with
SIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. The
startup process doesn't receive the SIGQUIT because it's in system(),
and pg_standby doesn't propagate it to the startup process either
because it traps it.

I think we should simply remove the signal handler for SIGQUIT from
pg_standby. Or will that lead to core dump by default? In that case, we
need pg_standby to exit(128) or similar, so that RestoreArchivedFile
understands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like we
do in walwriter and bgwriter:

/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*/
if (!PostmasterIsAlive(true))
exit(1);

However, I'm afraid there's a race condition with that. If we do that
right after system(), postmaster might've signaled us but not exited
yet. We could check that in the main loop, but if we wrongly interpret
the exit of the recovery command as a "file not found - go ahead and
start up", the damage might be done by the time we notice that the
postmaster is gone.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-02-26 18:51:14 Re: xpath processing brain dead
Previous Message Andrew Dunstan 2009-02-26 18:34:40 Re: xpath processing brain dead