Skip site navigation (1) Skip section navigation (2)

pg_standby, Restartable Recovery after Hard Failure

From: "Thomas F(dot) O'Connell" <tf(at)o(dot)ptimized(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: pg_standby, Restartable Recovery after Hard Failure
Date: 2007-04-18 23:11:51
Message-ID: D0F1DEF3-F8AA-4DBC-BCF8-D8C7FC3D06ED@o.ptimized.com (view raw or flat)
Thread:
Lists: pgsql-admin
Wanting a nice test of restartable recovery and pg_standby in a warm  
standby server scenario I'm testing, today I pulled the plug on the  
box where I was using Simon's test_warm_standby test harness.  
Basically, in this scenario, I had one posgres cluster (primary)  
against which pgbench was being run and a separate cluster (standby)  
that had been created from a base backup and then put into continuous  
recovery using pg_standby. In the middle of this scenario, I  
literally pulled the plug.

When the box came back up, I restarted primary. Everything came up  
fine. Then I restarted secondary. Here's what I got:

Trigger file            : /tmp/pgsql.trigger.5442
Waiting for WAL file    : ../archive/00000001000000000000000E
WAL file path           : 00000001000000000000000E
Restoring to...         : pg_xlog/RECOVERYXLOG
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp ../archive/00000001000000000000000E  
pg_xlog/RECOVER
YXLOG
Num archived files kept : all files
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
...

So something seems to have misfired in pg_standby. I'm having a hard  
time telling what might have hung it up. I wound up recovering by  
touching the trigger file, and standby came up as a running postgres  
server, but it was behind, probably as far as  
00000001000000000000000E. The curious part is that all the files were  
in the archive, so what state would pulling the plug have set that  
pg_standby either interpreted incorrectly or failed to interpret?

When I tested a lighter weight version of this scenario merely by  
killing standby from the command line and then restarting it, it did  
this:

Trigger file            : /tmp/pgsql.trigger.5442
Waiting for WAL file    : ../archive/00000001.history
WAL file path           : 00000001.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever
Command for restore     : cp ../archive/00000001.history pg_xlog/ 
RECOVERYHISTORY
Num archived files kept : all files
running restore         :cp: cannot access ../archive/00000001.history
cp: cannot access ../archive/00000001.history
cp: cannot access ../archive/00000001.history
not restored            : history file not found

But then it got back in the game and continued the continuous  
recovery process. I was able then to complete final recovery, and it  
seemed caught up.

If anyone can shed light on what might've happened in the hard  
failure scenario, I'd be interested to know. I've kept the various  
archive, primary, and standby directories created by  
test_warm_standby, so I can report on any file contents.

It occurs to me that timestamp information might be nice to have in  
pg_standby with debug mode. I might try patching pg_standby.c if no  
one beats me to it.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005

Responses

pgsql-admin by date

Next:From: Nico FrankenDate: 2007-04-19 09:08:15
Subject: pgsqlODBC installment
Previous:From: Carol WalterDate: 2007-04-18 21:29:04
Subject: Re: Auto vacuum

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group