Skip site navigation (1) Skip section navigation (2)

Re: Hot standby, recovery infra

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot standby, recovery infra
Date: 2009-02-26 18:38:33
Message-ID: 49A6E1A9.5020901@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> That whole area was something I was leaving until last, since immediate
>> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
>> this before Christmas, briefly).
> 
> This problem remains in current HEAD. I mean, immediate shutdown
> may be unable to kill the startup process because system() which
> executes restore_command ignores SIGQUIT while waiting.
> When I tried immediate shutdown during recovery, only the startup
> process survived. This is undesirable behavior, I think.

Yeah, we need to fix that.

> The following code should be added into RestoreArchivedFile()?
> 
> ----
> if (WTERMSIG(rc) == SIGQUIT)
>        exit(2);
> ----

I don't see how that helps, as we already have this in there:

	signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

	ereport(signaled ? FATAL : DEBUG2,
		(errmsg("could not restore file \"%s\" from archive: return code %d",
				xlogfname, rc)));

which means we already ereport(FATAL) if the restore command dies with 
SIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. The 
startup process doesn't receive the SIGQUIT because it's in system(), 
and pg_standby doesn't propagate it to the startup process either 
because it traps it.

I think we should simply remove the signal handler for SIGQUIT from 
pg_standby. Or will that lead to core dump by default? In that case, we 
need pg_standby to exit(128) or similar, so that RestoreArchivedFile 
understands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like we 
  do in walwriter and bgwriter:

		/*
		 * Emergency bailout if postmaster has died.  This is to avoid the
		 * necessity for manual cleanup of all postmaster children.
		 */
		if (!PostmasterIsAlive(true))
			exit(1);

However, I'm afraid there's a race condition with that. If we do that 
right after system(), postmaster might've signaled us but not exited 
yet. We could check that in the main loop, but if we wrongly interpret 
the exit of the recovery command as a "file not found - go ahead and 
start up", the damage might be done by the time we notice that the 
postmaster is gone.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

Responses

pgsql-hackers by date

Next:From: Robert HaasDate: 2009-02-26 18:51:14
Subject: Re: xpath processing brain dead
Previous:From: Andrew DunstanDate: 2009-02-26 18:34:40
Subject: Re: xpath processing brain dead

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group