Skip site navigation (1) Skip section navigation (2)

Warm Standby startup process unconditionally hangs

From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Warm Standby startup process unconditionally hangs
Date: 2011-03-21 16:44:41
Message-ID: 011953B5954BA87C050E45B3@[172.26.14.62] (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-bugs
PostgreSQL is 8.4.7, WAL-Logshipping is actually using SkyTools 2.1.12 
(both installed from PGDG rpms), running with CentOS 5.5 XEN-VM with Kernel 
2.6.18-194.26.1.el5xen.

recovery.conf on the standby installed by walmgr is

restore_command = '/usr/bin/walmgr.py 
/media/pgarchive/wal-config/wal-slave.ini xrestore %f "%p" %r'
#recovery_target_time=
#recovery_target_xid=
#recovery_target_inclusive=true
#recovery_target_timeline=

The archive is located on a NFS share. We have seen the following happening 
on a PostgreSQL Warm Standby:

Once for a while the Startup Process just stops recovering from archive. No 
more xlogs replayed, no more activity in the logs from the startup process. 
We've seen this two or three times in the past, this time we had the chance 
to do some further investigation what's happening.

strace shows the startup process hanging with

strace -p 31898
Process 31898 attached - interrupt to quit
semop(425990, 0x7fff19919060, 1

Here is a backtrace from the startup process:

(gdb) bt
#0  0x0000003e214d5337 in semop () from /lib64/libc.so.6
#1  0x00000000005a7ff3 in PGSemaphoreLock (sema=0x2b35a0140ce8, 
interruptOK=1 '\001') at pg_sema.c:420
#2  0x00000000005c68a5 in LockBufferForCleanup (buffer=78428) at 
bufmgr.c:2467
#3  0x0000000000485446 in RestoreBkpBlocks (lsn=<value optimized out>, 
record=0xcf2b830, cleanup=1 '\001') at xlog.c:3292
#4  0x0000000000460ce0 in heap2_redo (lsn=..., record=0x7fff19919060) at 
heapam.c:4805
#5  0x0000000000482eff in StartupXLOG () at xlog.c:5665
#6  0x0000000000484cfb in StartupProcessMain () at xlog.c:8102
#7  0x000000000049b5a8 in AuxiliaryProcessMain (argc=2, argv=<value 
optimized out>) at bootstrap.c:419
#8  0x00000000005b0ef9 in StartChildProcess (type=StartupProcess) at 
postmaster.c:4319
#9  0x00000000005b30c7 in PostmasterMain (argc=5, argv=0xcf079e0) at 
postmaster.c:1036
#10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at 
main.c:188

Looking at the code, it seems it is stuck in waiting for UnpinBuffer(), 
ProcWaitForSignal(). Getting the backtrace for the bgwriter process, it 
seems it is working as expected:

(gdb)
#0  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000006b086a in pg_usleep (microsec=<value optimized out>) at 
pgsleep.c:43
#2  0x00000000005ac00e in BackgroundWriterMain () at bgwriter.c:538
#3  0x000000000049b5b7 in AuxiliaryProcessMain (argc=2, argv=<value 
optimized out>) at bootstrap.c:424
#4  0x00000000005b0ef9 in StartChildProcess (type=BgWriterProcess) at 
postmaster.c:4319
#5  0x00000000005b4df6 in sigusr1_handler (postgres_signal_arg=<value 
optimized out>) at postmaster.c:4095
#6  <signal handler called>
#7  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
#8  0x00000000005b1b5e in ServerLoop () at postmaster.c:1347
#9  0x00000000005b30dc in PostmasterMain (argc=5, argv=0xcf079e0) at 
postmaster.c:1040
#10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at 
main.c:188

The logfile shows the following sequence before this issue:

<@ 2011-03-21 13:56:46 CET 4d7a3ae1.7c9a-1457> LOG:  restored log file 
"000000010000002000000070" from archive
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48794> DEBUG:  updated min 
recovery point to 20/704A6EC8
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48795> CONTEXT:  writing block 157 
of relation base/70631854/70632521
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48796> DEBUG:  updated min 
recovery point to 20/7054BD78
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48797> CONTEXT:  writing block 156 
of relation base/70631854/70632521
<@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48798> LOG:  restartpoint 
complete: wrote 4518 buffers (1.7%); write=225.415 s, sync=0.203 s, 
total=225.621 s
<@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48799> LOG:  recovery restart 
point at 20/68244EC8
<@ 2011-03-21 13:58:18 CET 4d7a3ae2.7c9e-48800> LOG:  restartpoint 
starting: time
<@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48801> LOG:  restartpoint 
complete: wrote 3733 buffers (1.4%); write=239.997 s, sync=0.022 s, 
total=240.022 s
<@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48802> LOG:  recovery restart 
point at 20/6BEB2EB0
<@ 2011-03-21 14:03:18 CET 4d7a3ae2.7c9e-48803> DEBUG:  skipping 
restartpoint, already performed at 20/6BEB2EB0
<@ 2011-03-21 14:03:33 CET 4d7a3ae2.7c9e-48804> DEBUG:  skipping 
restartpoint, already performed at 20/6BEB2EB0
<@ 2011-03-21 14:03:48 CET 4d7a3ae2.7c9e-48805> DEBUG:  skipping 
restartpoint, already performed at 20/6BEB2EB0

After this, the startup process just "hangs", with the symptoms shown 
above. I need to restart the standby now, because the customer wants to 
have it back as soon as possible, but if you need more infos, please let me 
know.

-- 
Thanks

	Bernd

pgsql-bugs by date

Next:From: Craig SaccoDate: 2011-03-22 05:10:53
Subject: BUG #5938: PostgreSQL Installer outputs log file with superuser password in clear text
Previous:From: Robert HaasDate: 2011-03-21 15:49:42
Subject: Re: TO_CHAR(timestamptz,datetimeformat) wrong after DST change

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group