Re: Warm Standby startup process unconditionally hangs

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bernd Helmle <mailings(at)oopsware(dot)de>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Warm Standby startup process unconditionally hangs
Date: 2011-03-23 20:37:11
Message-ID: AANLkTikWYDYwq50EwKiEDLkdw_kwQVpkY3ULiCsMY0vU@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Mar 21, 2011 at 4:44 PM, Bernd Helmle <mailings(at)oopsware(dot)de> wrote:
> PostgreSQL is 8.4.7, WAL-Logshipping is actually using SkyTools 2.1.12 (both
> installed from PGDG rpms), running with CentOS 5.5 XEN-VM with Kernel
> 2.6.18-194.26.1.el5xen.
>
> recovery.conf on the standby installed by walmgr is
>
> restore_command = '/usr/bin/walmgr.py
> /media/pgarchive/wal-config/wal-slave.ini xrestore %f "%p" %r'
> #recovery_target_time=
> #recovery_target_xid=
> #recovery_target_inclusive=true
> #recovery_target_timeline=
>
> The archive is located on a NFS share. We have seen the following happening
> on a PostgreSQL Warm Standby:
>
> Once for a while the Startup Process just stops recovering from archive. No
> more xlogs replayed, no more activity in the logs from the startup process.
> We've seen this two or three times in the past, this time we had the chance
> to do some further investigation what's happening.
>
> strace shows the startup process hanging with
>
> strace -p 31898
> Process 31898 attached - interrupt to quit
> semop(425990, 0x7fff19919060, 1
>
> Here is a backtrace from the startup process:
>
> (gdb) bt
> #0  0x0000003e214d5337 in semop () from /lib64/libc.so.6
> #1  0x00000000005a7ff3 in PGSemaphoreLock (sema=0x2b35a0140ce8,
> interruptOK=1 '\001') at pg_sema.c:420
> #2  0x00000000005c68a5 in LockBufferForCleanup (buffer=78428) at
> bufmgr.c:2467
> #3  0x0000000000485446 in RestoreBkpBlocks (lsn=<value optimized out>,
> record=0xcf2b830, cleanup=1 '\001') at xlog.c:3292
> #4  0x0000000000460ce0 in heap2_redo (lsn=..., record=0x7fff19919060) at
> heapam.c:4805
> #5  0x0000000000482eff in StartupXLOG () at xlog.c:5665
> #6  0x0000000000484cfb in StartupProcessMain () at xlog.c:8102
> #7  0x000000000049b5a8 in AuxiliaryProcessMain (argc=2, argv=<value
> optimized out>) at bootstrap.c:419
> #8  0x00000000005b0ef9 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4319
> #9  0x00000000005b30c7 in PostmasterMain (argc=5, argv=0xcf079e0) at
> postmaster.c:1036
> #10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
> main.c:188
>
> Looking at the code, it seems it is stuck in waiting for UnpinBuffer(),
> ProcWaitForSignal(). Getting the backtrace for the bgwriter process, it
> seems it is working as expected:
>
> (gdb)
> #0  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
> #1  0x00000000006b086a in pg_usleep (microsec=<value optimized out>) at
> pgsleep.c:43
> #2  0x00000000005ac00e in BackgroundWriterMain () at bgwriter.c:538
> #3  0x000000000049b5b7 in AuxiliaryProcessMain (argc=2, argv=<value
> optimized out>) at bootstrap.c:424
> #4  0x00000000005b0ef9 in StartChildProcess (type=BgWriterProcess) at
> postmaster.c:4319
> #5  0x00000000005b4df6 in sigusr1_handler (postgres_signal_arg=<value
> optimized out>) at postmaster.c:4095
> #6  <signal handler called>
> #7  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
> #8  0x00000000005b1b5e in ServerLoop () at postmaster.c:1347
> #9  0x00000000005b30dc in PostmasterMain (argc=5, argv=0xcf079e0) at
> postmaster.c:1040
> #10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
> main.c:188
>
> The logfile shows the following sequence before this issue:
>
> <@ 2011-03-21 13:56:46 CET 4d7a3ae1.7c9a-1457> LOG:  restored log file
> "000000010000002000000070" from archive
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48794> DEBUG:  updated min recovery
> point to 20/704A6EC8
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48795> CONTEXT:  writing block 157
> of relation base/70631854/70632521
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48796> DEBUG:  updated min recovery
> point to 20/7054BD78
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48797> CONTEXT:  writing block 156
> of relation base/70631854/70632521
> <@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48798> LOG:  restartpoint complete:
> wrote 4518 buffers (1.7%); write=225.415 s, sync=0.203 s, total=225.621 s
> <@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48799> LOG:  recovery restart point
> at 20/68244EC8
> <@ 2011-03-21 13:58:18 CET 4d7a3ae2.7c9e-48800> LOG:  restartpoint starting:
> time
> <@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48801> LOG:  restartpoint complete:
> wrote 3733 buffers (1.4%); write=239.997 s, sync=0.022 s, total=240.022 s
> <@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48802> LOG:  recovery restart point
> at 20/6BEB2EB0
> <@ 2011-03-21 14:03:18 CET 4d7a3ae2.7c9e-48803> DEBUG:  skipping
> restartpoint, already performed at 20/6BEB2EB0
> <@ 2011-03-21 14:03:33 CET 4d7a3ae2.7c9e-48804> DEBUG:  skipping
> restartpoint, already performed at 20/6BEB2EB0
> <@ 2011-03-21 14:03:48 CET 4d7a3ae2.7c9e-48805> DEBUG:  skipping
> restartpoint, already performed at 20/6BEB2EB0
>
> After this, the startup process just "hangs", with the symptoms shown above.
> I need to restart the standby now, because the customer wants to have it
> back as soon as possible, but if you need more infos, please let me know.

Bernd, I just realised that my email to you was private, not public.
Copying whole message again here.

The answer looks clear from here. What I'm stunned about is that we've
not had a report about this before now. (You replied there had been one).

ProcSendSignal() doesn't do anything at all when called with the
Startup proc's pid, because BackendPidGetProc() returns NULL.

The root cause appears to be the separation of the two parts of the
Hot Standby patch that occurred late in 8.4 cycle. Looks like the
protection intended to allow the bgwriter to coexist with the startup
process was never applied.

Similar issue exists in 9.0 also when HS not enabled. (Fixed)

The patches are slightly different because the infrastructure isn't
all there in 8.4.

Please can you test these before I commit to 8.4.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
publish_startup_info_84.v1.patch application/octet-stream 3.1 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2011-03-23 21:33:27 Re: Warm Standby startup process unconditionally hangs
Previous Message Susanne Ebrecht 2011-03-23 19:33:44 Re: TO_CHAR(timestamptz,datetimeformat) wrong after DST change