Skip site navigation (1) Skip section navigation (2)

Re: Hot standby, recovery infra

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot standby, recovery infra
Date: 2009-02-09 15:13:02
Message-ID: 499047FE.9090407@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Simon Riggs wrote:
> On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
>>>> - If you perform a fast shutdown while startup process is waiting for 
>>>> the restore command, startup process sometimes throws a FATAL error 
>>>> which leads escalates into an immediate shutdown. That leads to 
>>>> different messages in the logs, and skipping of the shutdown 
>>>> restartpoint that we now otherwise perform.
>>> Sometimes?
>> I think what happens is that if the restore command receives the SIGTERM 
>> and dies before the startup process that's waiting for the restore 
>> command receives the SIGTERM, the startup process throws a FATAL error 
>> because the restore command died unexpectedly. I put this
>>
>>> 	if (shutdown_requested && InRedo)
>>> 	{
>>> 		/* XXX: Is EndRecPtr always the right value here? */
>>> 		UpdateMinRecoveryPoint(EndRecPtr);
>>> 		proc_exit(0);
>>> 	}
>> right after the "system(xlogRestoreCmd)" call, to exit gracefully if we 
>> were requested to shut down while restore command was running, but it 
>> seems that that's not enough because of the race condition.
> 
> Can we trap the death of the restorecmd and handle it differently from
> the death of the startup process?

The startup process launches the restore command, so it's the startup 
process that needs to handle its death.

Anyway, I think I've found a solution. While we're executing the restore 
command, we're in a state that it's safe to proc_exit(0). We can set a 
flag to indicate to the signal handler when we're executing the restore 
command, so that the signal handler can do proc_exit(0) on SIGTERM. So 
if the startup process receives the SIGTERM first, it will proc_exit(0) 
immediately, and if the restore command dies first due to the SIGTERM, 
startup process exits with proc_exit(0) when it sees that restore 
command exited because of the SIGTERM. If either process receives 
SIGTERM for some other reason than a fast shutdown request, postmaster 
will see that the startup process exited unexpectedly, and handles that 
like a child process crash.

Attached is an updated patch that does that, and I've fixed all the 
other outstanding issues I listed earlier as well. Now I'm feeling again 
that this is in pretty good shape.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment: recovery-infra-2ffabdc.patch
Description: text/x-diff (68.9 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Andrew DunstanDate: 2009-02-09 15:44:17
Subject: Re: WIP: fix SET WITHOUT OIDS, add SET WITH OIDS
Previous:From: Mihai CrivetiDate: 2009-02-09 14:41:52
Subject: Re: 64 bit PostgreSQL 8.3.6 build on AIX 5300-09-02-0849 with IBM XL C/C++ 10.1.0.1 - initdb fails (could not dump unrecognized node type: 650)

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group