Warm standby recovery failure

From: Lee Azzarello <lee(at)dropio(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Warm standby recovery failure
Date: 2009-01-29 19:47:14
Message-ID: 16b031c0901291147m17c9e176occ73994d2706f454@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I'm new to the list. So hello everyone. This morning my monitoring
system sent me an alert that my warm standby database had exited
recovery mode. I checked on it when I woke up and sure enough, it had
stopped looking for new WAL files and became available for queries.
After some digging, I uncovered these details, in a chronological
order:

*NAGIOS*
Date/Time: Thu Jan 29 12:36:04 UTC 2009
DISK CRITICAL - free space: /var/lib/postgresql 35 MB (0% inode=99%):

** MYSTERY PERIOD **
(two crontabs to do WAL maintance are scheduled for 04:00, not
applicable to this situation)

*NAGIOS*
Date/Time: Thu Jan 29 12:49:05 UTC 2009
DISK OK - free space: /var/lib/postgresql 14129 MB (29% inode=99%)

* me working in a shell on the host *
# date
Thu Jan 29 15:33:35 UTC 2009
# df -h /dev/sdc1
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 50G 3.3G 44G 7% /var/lib/postgresql

/var/log/postgresql/postgresql-8.3-main.log
2009-01-29 12:41:36 UTC LOG: restored log file
"000000010000002F000000A9" from archive
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/./000000010000002F000000AA'
cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device
2009-01-29 12:48:14 UTC LOG: could not read from log file 47, segment
170, offset 3129344: Success
2009-01-29 12:48:14 UTC LOG: redo done at 2F/AA2FBE08
2009-01-29 12:48:14 UTC LOG: last completed transaction was at log
time 2009-01-29 12:42:50.572406+00
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
cp: overwrite `pg_xlog/./000000010000002F000000AA'? 2009-01-29
12:48:14 UTC LOG: restored log file "000000010000002F000000AA" from
archive
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG: selected new timeline ID: 2
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG: archive recovery complete

# /etc/init.d/postgresql-8.3 start
...
# ps ax | grep postgres
3499 ? Ss 0:00 postgres: startup process waiting for
000000020000002F000000AA
3503 ? S 0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA

WTF? (before I knew about timelines)

# ls -l /mnt/pg_xlog/000000020000002F000000AA
ls: cannot access /mnt/pg_xlog/000000020000002F000000AA: No such file
or directory

# updatedb
# locate 000000020000002F000000AA
(nothing)
# rm /var/lib/postgresql/8.3/main/pg_xlog/0*
(perhaps a bad move? I still have these in /mnt/pg_xlog)
# /etc/init.d/postgresql-8.3 restart

Same problem. WTF?

# rm /var/lib/postgresql/8.3/main/pg_xlog/archive_status/00000002.history.ready
# /etc/init.d/postgresql-8.3 restart
# ps ax | grep postgres
3499 ? Ss 0:00 postgres: startup process waiting for
000000020000002F000000AA
3503 ? S 0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA

Why is it looking for 000000020000002F000000AA? It should be looking
for 000000010000002F000000AA.
This seems to be a history problem with timelines. Recovery branched
into a new timeline, which is where it is now. I need to force it back
to the previous timeline so it can resume recovery. I'm building a
clean recovery database now but I'd like to get to the bottom of this
and bring the failed recovery database back into recovery mode so I
can learn how to get out of this situation more gracefully.

Thanks for any help you can offer.
_____________
Lee Azzarello
Staff Hacker
drop.io

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2009-01-30 01:06:41 Re: [GENERAL] Encoding problem using pg_dumpall
Previous Message Michael Monnerie 2009-01-29 19:14:28 Re: [GENERAL] Encoding problem using pg_dumpall