WAL archive stopping too early

From: Andy Shellam <andy(dot)shellam(at)mailnetwork(dot)co(dot)uk>
To: pgsql-admin(at)postgresql(dot)org
Subject: WAL archive stopping too early
Date: 2006-02-17 09:42:32
Message-ID: 43F59A88.1030504@mailnetwork.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I'm trying to set up WAL archiving on my server, and replicate my live
database to a hot-spare server using online backups.
However, the postmaster on the hot-spare is stopping too early during
the recovery.

In my archived xlog folder on the live server, I have the following
xlogs (note the backup history file):

-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:14
000000010000000900000019
-rw------- 1 postgresql datahandler 273 2006-02-16 15:14
000000010000000900000019.0035C2F4.backup
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:14
00000001000000090000001A
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50
00000001000000090000001B
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50
00000001000000090000001C
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50
00000001000000090000001D
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50
00000001000000090000001E
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:25
00000001000000090000001F
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:35
000000010000000900000020
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:38
000000010000000900000021
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:44
000000010000000900000022
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:46
000000010000000900000023

I have replicated this folder to the xlog_archive folder on the
hot-spare. I have taken a tar of the entire $PGDATA on the live server
(during a start_backup and stop_backup), and un-tarred it on the
hot-spare (changing the owner/permissions as appropriate.) Note the
pg_xlog was not included in the tar, so I've created the correct
structure on the spare (as a symlink to a separate physical disk.)

In my recovery.conf, I've added the single recovery_command and changed
the owner to my superuser (postgresql.)

Then when I start the postmaster on the spare, here's what follows in
the log:

2006-02-17 08:59:40 GMT LOG: starting archive recovery
2006-02-17 08:59:40 GMT LOG: restore_command = "cp
/mndata/logs/xlog_archive/%f "%p""
cp: cannot stat `/mndata/logs/xlog_archive/00000001.history': No such
file or directory
2006-02-17 08:59:40 GMT LOG: restored log file
"000000010000000900000019.0035C2F4.backup" from archive
2006-02-17 08:59:40 GMT LOG: restored log file
"000000010000000900000019" from archive
2006-02-17 08:59:40 GMT LOG: checkpoint record is at 9/1935C2F4
2006-02-17 08:59:40 GMT LOG: redo record is at 9/1935C2F4; undo record
is at 0/0; shutdown FALSE
2006-02-17 08:59:40 GMT LOG: next transaction ID: 795500; next OID: 37184
2006-02-17 08:59:40 GMT LOG: next MultiXactId: 1; next MultiXactOffset: 0
2006-02-17 08:59:40 GMT LOG: automatic recovery in progress
2006-02-17 08:59:40 GMT LOG: redo starts at 9/1935C338
2006-02-17 08:59:42 GMT LOG: restored log file
"00000001000000090000001A" from archive
2006-02-17 08:59:44 GMT LOG: restored log file
"00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG: record with zero length at 9/1BEB2430
2006-02-17 09:00:09 GMT LOG: redo done at 9/1BEB23EC
2006-02-17 09:00:09 GMT LOG: restored log file
"00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG: archive recovery complete
2006-02-17 09:00:20 GMT LOG: database system is ready
2006-02-17 09:00:20 GMT LOG: transaction ID wrap limit is 1074536488,
limited by database "dspam"

Note there are plenty more files ready to run in the archive, and the
contents of the backup label only mention 1A as the final WAL file of
the backup, so it is starting to read the next WAL file, but stopping
for some reason. Is it because of the "record with zero length" that's
causing it to fail? Also why is it restoring 1B from the archive
twice? There is a committed transaction in the live database that is
not appearing in the hot-spare after the restore, which is how I know
it's not completing correctly.

Here's the backup label file contents:

START WAL LOCATION: 9/1935C2F4 (file 000000010000000900000019)
STOP WAL LOCATION: 9/1AC3DCF8 (file 00000001000000090000001A)
CHECKPOINT LOCATION: 9/1935C2F4
START TIME: 2006-02-16 11:39:57 GMT
LABEL: /perth/backup/db/base_20060216.tar.gz
STOP TIME: 2006-02-16 12:13:49 GMT

Can anyone shed any light?

Many thanks

Andy Shellam
the mail network - server support

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Gregory S. Williamson 2006-02-17 12:33:34 Re: Something like pg_dump for 7.4
Previous Message Gregory S. Williamson 2006-02-17 08:58:41 Re: Something like pg_dump for 7.4