Warm Standby startup process unconditionally hangs

From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Warm Standby startup process unconditionally hangs
Date: 2011-03-21 16:44:41
Message-ID: 011953B5954BA87C050E45B3@[172.26.14.62]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

PostgreSQL is 8.4.7, WAL-Logshipping is actually using SkyTools 2.1.12
(both installed from PGDG rpms), running with CentOS 5.5 XEN-VM with Kernel
2.6.18-194.26.1.el5xen.

recovery.conf on the standby installed by walmgr is

restore_command = '/usr/bin/walmgr.py
/media/pgarchive/wal-config/wal-slave.ini xrestore %f "%p" %r'
#recovery_target_time=
#recovery_target_xid=
#recovery_target_inclusive=true
#recovery_target_timeline=

The archive is located on a NFS share. We have seen the following happening
on a PostgreSQL Warm Standby:

Once for a while the Startup Process just stops recovering from archive. No
more xlogs replayed, no more activity in the logs from the startup process.
We've seen this two or three times in the past, this time we had the chance
to do some further investigation what's happening.

strace shows the startup process hanging with

strace -p 31898
Process 31898 attached - interrupt to quit
semop(425990, 0x7fff19919060, 1

Here is a backtrace from the startup process:

(gdb) bt
#0 0x0000003e214d5337 in semop () from /lib64/libc.so.6
#1 0x00000000005a7ff3 in PGSemaphoreLock (sema=0x2b35a0140ce8,
interruptOK=1 '\001') at pg_sema.c:420
#2 0x00000000005c68a5 in LockBufferForCleanup (buffer=78428) at
bufmgr.c:2467
#3 0x0000000000485446 in RestoreBkpBlocks (lsn=<value optimized out>,
record=0xcf2b830, cleanup=1 '\001') at xlog.c:3292
#4 0x0000000000460ce0 in heap2_redo (lsn=..., record=0x7fff19919060) at
heapam.c:4805
#5 0x0000000000482eff in StartupXLOG () at xlog.c:5665
#6 0x0000000000484cfb in StartupProcessMain () at xlog.c:8102
#7 0x000000000049b5a8 in AuxiliaryProcessMain (argc=2, argv=<value
optimized out>) at bootstrap.c:419
#8 0x00000000005b0ef9 in StartChildProcess (type=StartupProcess) at
postmaster.c:4319
#9 0x00000000005b30c7 in PostmasterMain (argc=5, argv=0xcf079e0) at
postmaster.c:1036
#10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
main.c:188

Looking at the code, it seems it is stuck in waiting for UnpinBuffer(),
ProcWaitForSignal(). Getting the backtrace for the bgwriter process, it
seems it is working as expected:

(gdb)
#0 0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
#1 0x00000000006b086a in pg_usleep (microsec=<value optimized out>) at
pgsleep.c:43
#2 0x00000000005ac00e in BackgroundWriterMain () at bgwriter.c:538
#3 0x000000000049b5b7 in AuxiliaryProcessMain (argc=2, argv=<value
optimized out>) at bootstrap.c:424
#4 0x00000000005b0ef9 in StartChildProcess (type=BgWriterProcess) at
postmaster.c:4319
#5 0x00000000005b4df6 in sigusr1_handler (postgres_signal_arg=<value
optimized out>) at postmaster.c:4095
#6 <signal handler called>
#7 0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
#8 0x00000000005b1b5e in ServerLoop () at postmaster.c:1347
#9 0x00000000005b30dc in PostmasterMain (argc=5, argv=0xcf079e0) at
postmaster.c:1040
#10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
main.c:188

The logfile shows the following sequence before this issue:

<@ 2011-03-21 13:56:46 CET 4d7a3ae1.7c9a-1457> LOG: restored log file
"000000010000002000000070" from archive
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48794> DEBUG: updated min
recovery point to 20/704A6EC8
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48795> CONTEXT: writing block 157
of relation base/70631854/70632521
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48796> DEBUG: updated min
recovery point to 20/7054BD78
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48797> CONTEXT: writing block 156
of relation base/70631854/70632521
<@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48798> LOG: restartpoint
complete: wrote 4518 buffers (1.7%); write=225.415 s, sync=0.203 s,
total=225.621 s
<@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48799> LOG: recovery restart
point at 20/68244EC8
<@ 2011-03-21 13:58:18 CET 4d7a3ae2.7c9e-48800> LOG: restartpoint
starting: time
<@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48801> LOG: restartpoint
complete: wrote 3733 buffers (1.4%); write=239.997 s, sync=0.022 s,
total=240.022 s
<@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48802> LOG: recovery restart
point at 20/6BEB2EB0
<@ 2011-03-21 14:03:18 CET 4d7a3ae2.7c9e-48803> DEBUG: skipping
restartpoint, already performed at 20/6BEB2EB0
<@ 2011-03-21 14:03:33 CET 4d7a3ae2.7c9e-48804> DEBUG: skipping
restartpoint, already performed at 20/6BEB2EB0
<@ 2011-03-21 14:03:48 CET 4d7a3ae2.7c9e-48805> DEBUG: skipping
restartpoint, already performed at 20/6BEB2EB0

After this, the startup process just "hangs", with the symptoms shown
above. I need to restart the standby now, because the customer wants to
have it back as soon as possible, but if you need more infos, please let me
know.

--
Thanks

Bernd

Browse pgsql-bugs by date

  From Date Subject
Next Message Craig Sacco 2011-03-22 05:10:53 BUG #5938: PostgreSQL Installer outputs log file with superuser password in clear text
Previous Message Robert Haas 2011-03-21 15:49:42 Re: TO_CHAR(timestamptz,datetimeformat) wrong after DST change