Re: SR standby hangs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SR standby hangs
Date: 2011-04-26 20:01:38
Message-ID: 4DB724A2.9040809@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/22/2011 07:55 AM, Robert Haas wrote:
> On Sun, Feb 20, 2011 at 12:39 PM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>> On Fri, Feb 18, 2011 at 6:59 PM, Andrew Dunstan<amdunstan(at)nc(dot)rr(dot)com> wrote:
>>> The server is running as a warm standby, and the client's application tries
>>> to connect to both the master and the slave, accepting whichever lets it
>>> connect (hence hot standby is not turned on).
>>> ...
>>> #2 0x00000000005de645 in LockBufferForCleanup () at bufmgr.c:2432
>>> #3 0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>,
>> Hm, if hot standby isn't turned on why are we bothering with locks at
>> all? But that said is there *anything* else happening in the database
>> aside from recovery? Are there any errors in the database log?
>>
>> But this still shouldn't block. It's either blocked locking the buffer
>> or blocked waiting for the buffer to become unpinned. It would be nice
>> to get a backtrace from a debugging build which wouldn't have some of
>> the functions inlined. It would be really nice to see the pin count on
>> the buffer in question -- perhaps it has gotten out of sync or
>> underflowed?
> I realize now that it would also be useful to see the state of the
> LWLock on the buffer.

This has happened again. This time we have some debug info available,
and can possible get more, if people tell me what will be helpful:

(gdb) bt
#0 0x0000003493ed5337 in semop () from /lib64/libc.so.6
#1 0x00000000005bd1e3 in PGSemaphoreLock (sema=0x2b1a6014ac38,
interruptOK=1 '\001') at pg_sema.c:420
#2 0x00000000005de735 in LockBufferForCleanup (buffer=310163) at
bufmgr.c:2432
#3 0x000000000048b896 in RestoreBkpBlocks (lsn=<value optimized
out>, record=0x2b1a6191b010, cleanup=1 '\001') at xlog.c:3540
#4 0x0000000000463788 in heap_xlog_clean (lsn=<value optimized
out>, record=0x2b1a6191b010) at heapam.c:4160
#5 heap2_redo (lsn=<value optimized out>, record=0x2b1a6191b010) at
heapam.c:4858
(gdb) f 2
#2 0x00000000005de735 in LockBufferForCleanup (buffer=310163) at
bufmgr.c:2432
2432 ProcWaitForSignal();
(gdb) p *bufHdr
$2 = {tag = {rnode = {spcNode = 16393, dbNode = 40475, relNode =
41880}, forkNum = MAIN_FORKNUM, blockNum = 18913}, flags = 6,
usage_count = 1, refcount = 1, wait_backend_pid = 9111,
buf_hdr_lock = 0 '\000', buf_id = 310162, freeNext = -2,
io_in_progress_lock = 620448, content_lock = 620449}

Strangely, I don't even see ProcWaitForSignal() in the frame list - I
shouldn't have thought it was a candidate to be optimised away.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2011-04-26 20:19:35 Re: Bad COMPACT_ALLOC_CHUNK size in tsearch/spell.c?
Previous Message Bruce Momjian 2011-04-26 19:53:32 Re: pg_ctl restart - behaviour based on wrong instance