pitr replica dies on startup

From: Jeff Frost <jeff(at)frostconsultingllc(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: pitr replica dies on startup
Date: 2007-08-31 22:44:57
Message-ID: Pine.LNX.4.64.0708311523260.11463@discord.home.frostconsultingllc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi guys, I've inherited a PITR continuous recovery master/standby server pair.
The continuous recovery and loading of the xlogs seems to work fine, but when
I opted to test the replica bring up, it falls over with signal 6.

Here's an excerpt from the log with log levels set up to debug5:

DEBUG: max_safe_fds = 984, usable_fds = 1000, already_open = 6
LOG: database system was interrupted at 2007-08-31 15:09:40 PDT
LOG: starting archive recovery
LOG: restore_command = "/data/pg/data/recovery.py /data/pg/wal/%f "%p""
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001.history "pg_xlog/RECOVERYHISTORY""
DEBUG: could not restore file "00000001.history" from archive: return code 0
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001000000000000004C.00000020.backup
"pg_xlog/RECOVERYHISTORY""
LOG: restored log file "00000001000000000000004C.00000020.backup" from
archive
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001000000000000004C "pg_xlog/RECOVERYXLOG""
LOG: restored log file "00000001000000000000004C" from archive
LOG: checkpoint record is at 0/4C000020
LOG: redo record is at 0/4C000020; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 0/1222; next OID: 24576
LOG: next MultiXactId: 1; next MultiXactOffset: 0
LOG: automatic recovery in progress
LOG: redo starts at 0/4C000070
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001000000000000004D "pg_xlog/RECOVERYXLOG""
LOG: restored log file "00000001000000000000004D" from archive
DEBUG: recovery restart point at 0/4D000020
CONTEXT: xlog redo checkpoint: redo 0/4D000020; undo 0/0; tli 1; xid 0/1232;
oid 24576; multi 1; offset 0; online
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001000000000000004E "pg_xlog/RECOVERYXLOG""
LOG: restored log file "00000001000000000000004E" from archive
DEBUG: recovery restart point at 0/4E000020
CONTEXT: xlog redo checkpoint: redo 0/4E000020; undo 0/0; tli 1; xid 0/1242;
oid 24576; multi 1; offset 0; online
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001000000000000004F "pg_xlog/RECOVERYXLOG""
LOG: restored log file "00000001000000000000004F" from archive
DEBUG: recovery restart point at 0/4F000020
CONTEXT: xlog redo checkpoint: redo 0/4F000020; undo 0/0; tli 1; xid 0/1252;
oid 24576; multi 1; offset 0; online
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/000000010000000000000050 "pg_xlog/RECOVERYXLOG""
DEBUG: could not restore file "000000010000000000000050" from archive: return
code 256
LOG: could not open file "pg_xlog/000000010000000000000050" (log file 0,
segment 80): No such file or directory
LOG: redo done at 0/4F000070
DEBUG: executing restore command "/data/pg/data/recovery.py
/data/pg/wal/00000001000000000000004F "pg_xlog/RECOVERYXLOG""
DEBUG: could not restore file "00000001000000000000004F" from archive: return
code 256
PANIC: could not open file "pg_xlog/00000001000000000000004F" (log file 0,
segment 79): No such file or directory
DEBUG: reaping dead processes
LOG: startup process (PID 31211) was terminated by signal 6
LOG: aborting startup due to startup process failure
DEBUG: proc_exit(1)
DEBUG: shmem_exit(1)
DEBUG: exit(1)

You can see the rest of the log file here:

http://pgsql.privatepaste.com/841mOR5gzm

It seems like everything is happy, except it seems to ask for the 4F log file
more than once.

Directory permissions look good:

total 45
-rw------- 1 postgres postgres 4 Aug 31 11:13 PG_VERSION
-rw------- 1 postgres postgres 167 Aug 31 15:09 backup_label.old
drwx------ 5 postgres postgres 120 Aug 31 14:59 base
drwx------ 2 postgres postgres 768 Aug 31 15:08 global
drwx------ 2 postgres postgres 72 Aug 31 11:13 pg_clog
-rw------- 1 postgres postgres 3279 Aug 31 11:16 pg_hba.conf
-rw------- 1 postgres postgres 1460 Aug 31 11:14 pg_ident.conf
drwx------ 4 postgres postgres 96 Aug 31 11:13 pg_multixact
drwx------ 2 postgres postgres 72 Aug 31 11:13 pg_subtrans
drwx------ 2 postgres postgres 48 Aug 31 11:13 pg_tblspc
drwx------ 2 postgres postgres 48 Aug 31 11:13 pg_twophase
drwx------ 3 postgres postgres 112 Aug 31 15:24 pg_xlog
-rw-r--r-- 1 postgres postgres 15370 Aug 31 14:57 postgresql.conf
-rw------- 1 postgres postgres 35 Aug 31 15:09 postmaster.opts
-rw-r--r-- 1 postgres postgres 67 Aug 31 13:40 recovery.conf
-rwxr-xr-x 1 postgres postgres 1384 Aug 31 13:40 recovery.py

I verified that the files are really being copied into
pg_xlog/RECOVERYXLOG.

This is 8.2.4 on Gentoo, but I observed the same behavior with 8.2.3 on
Gentoo.

We've md5sum'd the postgres binary, diffed the output of pg_config and
everything is the same.

I tried starting up postgres in a newly initdb'd directory and it starts up
fine. I also tried using the conf files from the 'broken' directory in this
new directory and it starts up fine, but if I try to replicate into this
directory, postgres exits with signal 6 just the same.

Conversely, using a default postgresql.conf file in the 'broken' directory
leaves it just as broken.

The master can start/stop just fine with no problems.

What else to look at? I do have a core file, back trace can be found here:

http://pgsql.privatepaste.com/8f0WDAAzz9

Any suggestions much appreciated!

--
Jeff Frost, Owner <jeff(at)frostconsultingllc(dot)com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2007-08-31 23:19:33 Re: pitr replica dies on startup
Previous Message Kevin Grittner 2007-08-31 21:12:02 Re: How to monitor resources on Linux.