Re: error: could not find pg_class tuple for index 2662

From: daveg <daveg(at)sonic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: error: could not find pg_class tuple for index 2662
Date: 2011-08-01 03:06:31
Message-ID: 20110801030630.GG15578@sonic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:
> daveg <daveg(at)sonic(dot)net> writes:
> > Here is the update: the problem happens with vacuum full alone, no reindex
> > is needed to trigger it. I updated the script to avoid reindexing after
> > vacuum. Over the past two days there are still many ocurrances of this
> > error coincident with the vacuum.
>
> Well, that jives with the assumption that the one case we saw in
> the buildfarm was the same thing, because the regression tests were
> certainly only doing a VACUUM FULL and not a REINDEX of pg_class.
> But it doesn't get us much closer to understanding what's happening.
> In particular, it seems to knock out most ideas associated with race
> conditions, because the VAC FULL should hold exclusive lock on pg_class
> until it's completely done (including index rebuilds).
>
> I think we need to start adding some instrumentation so we can get a
> better handle on what's going on in your database. If I were to send
> you a source-code patch for the server that adds some more logging
> printout when this happens, would you be willing/able to run a patched
> build on your machine?

Yes we can run an instrumented server so long as the instrumentation does
not interfere with normal operation. However, scheduling downtime to switch
binaries is difficult, and generally needs to be happen on a weekend, but
sometimes can be expedited. I'll look into that.

> (BTW, just to be perfectly clear ... the "could not find pg_class tuple"
> errors always mention index 2662, right, never any other number?)

Yes, only index 2662, never any other.

I'm attaching a somewhat redacted log for two different databases on the same
instance around the time of vacuum full of pg_class in each database.
My observations so far are:

- the error occurs at commit of vacuum full of pg_class
- in these cases error hits autovacuum after it waited for a lock on pg_class
- in these two cases there was a new process startup while the vacuum was
running. Don't know if this is relevant.
- while these hit autovacuum, the error does hit other processs (just not in
these sessions). Unknown if autovacuum is a required component.

-dg

--
David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

Attachment Content-Type Size
transcript.0729.c01 text/plain 3.0 KB
transcript.0729.c57 text/plain 7.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2011-08-01 09:24:43 Re: One-Shot Plans
Previous Message Jeff Davis 2011-08-01 01:49:03 Re: SSI heap_insert and page-level predicate locks