Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741
Date: 2012-05-31 13:51:25
Message-ID: 8799.1338472285@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> The one thing that still seems a little odd to me is that this caused
> a pin count to get orphaned. It seems reasonable that ignoring the
> AccessExclusiveLock could result in not-found errors trying to open a
> missing relation, and even fsync requests on a missing relation. But
> I don't see why that would cause the backend-local pin counts to get
> messed up, which makes me wonder if there really is another bug here
> somewhere.

According to Heikki's log, the Assert was in the startup process itself,
and it happened after an error:

> 2012-05-26 10:44:28.587 CEST 10270 FATAL: could not open file "base/21268/32994": No such file or directory
> 2012-05-26 10:44:28.588 CEST 10270 CONTEXT: writing block 2508 of relation base/21268/32994
> xlog redo multi-insert (init): rel 1663/21268/33006; blk 3117; 58 tuples
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1741)
> 2012-05-26 10:44:31.131 CEST 10269 LOG: startup process (PID 10270) was terminated by signal 6: Aborted

I don't think that code is meant to recover from errors anyway, so
the fact that it fails with a pin count held isn't exactly surprising.
But it might be worth looking at exactly which on_proc_exit callbacks
are installed in the startup process and what assumptions they make.

As for where the error came from in the first place, it's easy to
imagine somebody who's not got the word about the AccessExclusiveLock
reading pages of the table into buffers that have already been scanned
by the DROP. So you'd end up with orphaned buffers belonging to a
vanished table. If somebody managed to dirty them by setting hint bits
(we do allow that in HS mode no?) then later you'd have various processes
trying to write the buffer before recycling it, which seems to fit the
reported error.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-31 13:58:33 Re: Uh, I change my mind about commit_delay + commit_siblings (sort of)
Previous Message Robert Klemme 2012-05-31 13:38:35 Re: pg_dump and thousands of schemas