Re: An example of bugs for Hot Standby

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Hiroyuki Yamada <yamada(at)kokolink(dot)net>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An example of bugs for Hot Standby
Date: 2009-12-17 22:54:16
Message-ID: 1261090456.634.4975.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2009-12-16 at 14:05 +0000, Simon Riggs wrote:
> On Wed, 2009-12-16 at 10:33 +0000, Simon Riggs wrote:
> > On Tue, 2009-12-15 at 20:25 +0900, Hiroyuki Yamada wrote:
> > > Hot Standby node can freeze when startup process calls LockBufferForCleanup().
> > > This bug can be reproduced by the following procedure.
> >
> > Interesting. Looks like this can happen, which is a shame cos I just
> > removed the wait checking code after not ever having seen a wait.
> >
> > Thanks for the report.
> >
> > Must-fix item for HS.
>
> So this deadlock can happen at two places:
>
> 1. When a relation lock waits behind an AccessExclusiveLock and then
> Startup runs LockBufferForCleanup()
>
> 2. When Startup is a pin count waiter and a lock acquire begins to wait
> on a relation lock
>
> So we must put in direct deadlock detection in both places. We can't use
> the normal deadlock detector because in case (1) the backend might
> already have exceeded deadlock_timeout.
>
> Proposal:

Better proposal

* It's possible for 3-way deadlocks to occur in Hot Standby mode.
* If a user backend sleeps on a lock while it holds a buffer pin that
* leaves open the risk of deadlock. The user backend will only sleep
* if it waits behind an AccessExclusiveLock held by Startup process.
* If the Startup process then tries to access any buffer that is pinned
* then it too will sleep and neither process will ever wake.
*
* We need to make a deadlock check in two places: in the user backend
* when we sleep on a lock, and in the Startup process when we sleep
* on a buffer pin. We need both checks because the deadlock can occur
* from both directions.
*
* Just before a user backend sleeps on a lock, we accumulate a list of
* buffers pinned by the backend. We then grab the an LWlock
* and then check each of the buffers to see if the Startup process is
* waiting on them. If so, we release the lock and throw deadlock error.
* If Startup process is not waiting we then record the pinned buffers
* in the BufferDeadlockRisk data structure and release the lock.
* When we later get the lock we remove the deadlock risk.
*
* When the Startup process is about to wait on a buffer pin it checks
* the buffer it is about to pin in the BufferDeadlockRisk list. If the
* buffer is already held by one or more lock waiters then we send a
* conflict cancel to them and wait for them to die before rechecking
* the buffer lock.

This way we only cancel direct deadlocks.

It doesn't solve general problem of buffer waits, but they may be
solvable by different mechanism.

--
Simon Riggs www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Williamson 2009-12-17 23:22:10 Re: PATCH: Spurious "22" in hstore.sgml
Previous Message Stephen Frost 2009-12-17 21:38:05 Re: [PATCH] remove redundant ownership checks