Spontaneous PITR standby activiation

From: David Christensen <david(at)endpoint(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Spontaneous PITR standby activiation
Date: 2009-11-25 20:53:34
Message-ID: CBC1F2EC-CDE6-45CD-8101-2695C427F55E@endpoint.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Last night we had the standby server in a PITR setup seemingly
spontaneously believe it had recovered completely and startup a new
timeline; I'm running this information by you guys to see if what had
happened is considered a bug, or if there is a simple explanation.

So some questions here:

1) is there a hard limit of the number of times the archive_command
will attempt? I didn't see anything documented about this in the PITR
or config docs, so I'm guessing the 10 failures I saw in the log were
just coincidental.

2) are the archive_command failures in the master's log responsible
for the redo records?

3) would a Pg forced shutdown cause issues with the generated WAL when
replaying?

4) at first I thought it had to do with a bug/failure in pg_standby,
but I'm wondering if it has to do with the "record with zero length"
referenced in the standby's logs. Thoughts?

Details:

The postgres partition had completely filled up on the master, causing
the postmaster to shut down. We removed some cruft from the partition
and were able to get the server started back up at this time. When
checking on the standby server, I noticed that it had switched out of
recovery mode and was archiving its own WAL, etc, and was perfectly
connectable from the command line.

The archive_command is:

cp -i "%p" /mnt/pg_wal_archive/"%f" && cat %p | ssh backupserver
"cat - > /mnt/pg_wal_archive/.%f; if test -e /mnt/pg_wal_archive/%f ;
then rm /mnt/pg_wal_archive/.%f ; exit 1; else mv -f /mnt/
pg_wal_archive/.%f /mnt/pg_wal_archive/%f ; fi"

The recovery.conf file is a fairly vanilla pg_standby invocation:

restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -c -d -s
2 -t /tmp/pgsql.trigger.6666 /db_data/pg_wal_archive %f %p %r 2>> /
db_data/pg_log/warmstandby.log'

Some possibly relevant settings from the master:

name | setting
------------------------------+-----------
archive_mode | on
archive_timeout | 60
checkpoint_completion_target | 0.7
checkpoint_segments | 100
checkpoint_timeout | 600
checkpoint_warning | 30
commit_delay | 0
commit_siblings | 5
fsync | on
full_page_writes | on
synchronous_commit | on
wal_buffers | 128
wal_sync_method | fdatasync
wal_writer_delay | 200

----
Relevant lines from the standby:

2009-11-25 04:03:52 UTC [6315]: [7478-1] LOG: restored log file
"000000020000064200000001" from archive
2009-11-25 04:03:58 UTC [6315]: [7479-1] LOG: restored log file
"000000020000064200000002" from archive
2009-11-25 04:44:09 UTC [6315]: [7480-1] LOG: restored log file
"000000020000064200000003" from archive
2009-11-25 04:45:22 UTC [6315]: [7481-1] LOG: record with zero length
at 642/3FFEB38
2009-11-25 04:45:22 UTC [6315]: [7482-1] LOG: redo done at 642/3FFEAF0
2009-11-25 04:45:22 UTC [6315]: [7483-1] LOG: last completed
transaction was at log time 2009-11-25 04:04:03.067187+00
2009-11-25 04:45:22 UTC [6315]: [7484-1] LOG: restored log file
"000000020000064200000003" from archive
2009-11-25 04:45:28 UTC [6315]: [7485-1] LOG: selected new timeline
ID: 3
2009-11-25 04:45:28 UTC [6315]: [7486-1] LOG: restored log file
"00000002.history" from archive
2009-11-25 04:45:28 UTC [6315]: [7487-1] LOG: archive recovery complete
2009-11-25 04:45:28 UTC [6315]: [7488-1] LOG: checkpoint starting:
shutdown immediate
2009-11-25 04:45:29 UTC [6315]: [7489-1] LOG: checkpoint complete:
wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 14
recycled; write=0.009 s, sync=0.010 s, total=0.404 s
2009-11-25 04:45:30 UTC [22614]: [1-1] LOG: autovacuum launcher started
2009-11-25 04:45:30 UTC [6313]: [1-1] LOG: database system is ready
to accept connections
2009-11-25 04:55:30 UTC [22612]: [1-1] LOG: checkpoint starting: time
----
From the master:
aws-oi-db-master-11192009:/var/log/postgresql# date
Wed Nov 25 14:40:44 EST 2009
aws-oi-db-master-11192009:/var/log/postgresql# grep WARNING
postgresql-2009-11-25.log
2009-11-25 00:00:34 EST [30401]: [126-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:01:36 EST [30401]: [133-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:02:34 EST [30401]: [140-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:03:34 EST [30401]: [147-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:04:34 EST [30401]: [154-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:05:34 EST [30401]: [161-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:06:34 EST [30401]: [168-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:07:34 EST [30401]: [175-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:08:34 EST [30401]: [182-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
2009-11-25 00:09:34 EST [30401]: [189-1] WARNING: transaction log
file "000000020000064200000003" could not be archived: too many failures
----
Regards,

David
--
David Christensen
End Point Corporation
david(at)endpoint(dot)com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2009-11-25 20:56:09 Re: [PATCH 4/4] Add tests to dblink covering use of COPY TO FUNCTION
Previous Message Zdenek Kotala 2009-11-25 20:36:11 [PATCH] Add solaris path for docbook COLLATEINDEX