Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: daveg <daveg(at)sonic(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held
Date: 2011-09-07 13:46:19
Message-ID: CA+TgmoYjEZwAUz4SrTtLj5MPLRdmwEwSZMgc7kAfv2q2=5r7_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 5:16 AM, daveg <daveg(at)sonic(dot)net> wrote:
> On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
>> On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg(at)sonic(dot)net> wrote:
>> > So far I've got:
>> >
>> >  - affects system tables
>> >  - happens very soon after process startup
>> >  - in 8.4.7 and 9.0.4
>> >  - not likely to be hardware or OS related
>> >  - happens in clusters for period of a few second to many minutes
>> >
>> > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
>> > hard to get downtime to pick up new builds. Any other ideas on getting to
>> > the bottom of this?
>>
>> I've been thinking this one over, and doing a little testing. I'm
>> still stumped, but I have a few thoughts.  What that error message is
>> really saying is that the LOCALLOCK bookkeeping doesn't match the
>> PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
> ...
>> My second thought is that perhaps a process is occasionally managing
>> to exit without fully cleaning up the associated PROCLOCK entry.  At
>> first glance, it appears that this would explain the observed
>> symptoms.  A new backend gets the PGPROC belonging to the guy who
>> didn't clean up after himself, hits the error, and disconnects,
>> sticking himself right back on to the head of the SHM_QUEUE where the
>> next connection will inherit the same PGPROC and hit the same problem.
>>  But it's not clear to me what could cause the system to get into this
>> state in the first place, or how it would eventually right itself.
>>
>> It might be worth kludging up your system to add a test to
>> InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
>> either NULL or empty, along the lines of the attached patch (which
>> assumes that assertions are enabled; otherwise, put in an elog() of
>> some sort).  Actually, I wonder if we shouldn't move all the
>> SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
>> doing it over again every time someone calls InitProcess().  Besides
>> being a waste of cycles, it's probably less robust this way.   If
>> there somehow are leftovers in one of those queues, the next
>> successful call to LockReleaseAll() ought to clean up the mess, but of
>> course there's no chance of that working if we've nuked the queue
>> pointers.
>
> I did this in the elog flavor as we don't build production images with asserts.
> It has been running on all hosts for a few days. Today it hit the extra
> checks in initproc.
>
> 00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
> 00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
> 00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558)
> 00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
> 00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
> 00:02:32.783  21535  LOG:  terminating any other active server processes
> 00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process
>
> The patch that produced this is attached. If you can think of anything I
> can add to this to help I'd be happy to do so. Also, can I clean this up
> and continue somehow? Maybe clear the queue instead having to have a restart?
> Or is there a way to just pause this proc here, maybe mark it not to be used
> and exit, or just to sleep forever so I can debug later?

I think what we really need to know here is: when the debugging code
you injected here fires, what happened to the previous owner of that
PGPROC that caused it to not clean up its state properly? The PGPROC
that 8627 inherited in the above example is obviously messed up - but
what did the last guy do that messed it up? It would be nice to log
the address of the PGPROC in every log message here - then you could
go back and look to see whether the last guy terminated in some
unusual way. In the meantime, could you could look back a couple of
minutes from the time of the above occurrence and see if there are any
FATAL errors, or usual ERRORs?

After spending some time staring at the code, I do have one idea as to
what might be going on here. When a backend is terminated,
ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all
user locks, and the first one (or so we suppose) releases all normal
locks as part of aborting the transaction. But what if there's no
transaction in progress? In that case, AbortOutOfAnyTransaction()
won't do anything - which is fine as far as transaction-level locks
go, because we shouldn't be holding any of those anyway if there's no
transaction in progress. However, if we hold a session-level lock at
that point, then we'd orphan it. We don't make much use of session
locks. As far as I can see, they are used by (1) VACUUM, (2) CREATE
INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
standby servers, redo of DROP DATABASE actions. Any chance one of
those died or was killed off around the time this happened?

One thing that might be worth doing in the interest of debugging is
adding code just like what you've already added to ShutdownPostgres(),
*after* calling AbortOutOfAnyTransaction() and LockReleaseAll(). That
might enable us to trap the problem at the time it happens, rather
than waiting until the next backend gets that PGPROC. Of course that
still might not answer the question of how it got like that, but at
least we'd be getting closer to the time of the actual event.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2011-09-07 13:51:04 Re: Rectifying wrong Date outputs
Previous Message Thom Brown 2011-09-07 13:39:11 Re: [v9.2] Fix Leaky View Problem