Re: error: could not find pg_class tuple for index 2662

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: daveg <daveg(at)sonic(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: error: could not find pg_class tuple for index 2662
Date: 2011-07-29 15:17:30
Message-ID: CA+TgmoYpn8hdVZ9JK1j1hRCEuWddmyPmLD=BL7P17J-_USHcvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> daveg <daveg(at)sonic(dot)net> writes:
>> On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:
>>> Ah, OK, sorry.  Well, in 9.0, VACUUM FULL is basically CLUSTER, which
>>> means that a REINDEX is happening as part of the same operation.  In
>>> 9.0, there's no point in doing VACUUM FULL immediately followed by
>>> REINDEX.  My guess is that this is happening either right around the
>>> time the VACUUM FULL commits or right around the time the REINDEX
>>> commits.  It'd be helpful to know which, if you can figure it out.
>
>> I'll update my vacuum script to skip reindexes after vacuum full for 9.0
>> servers and see if that makes the problem go away.
>
> The thing that was bizarre about the one instance in the buildfarm was
> that the error was persistent, ie, once a session had failed all its
> subsequent attempts to access pg_class failed too.  I gather from Dave's
> description that it's working that way for him too.  I can think of ways
> that there might be a transient race condition against a REINDEX, but
> it's very unclear why the failure would persist across multiple
> attempts.  The best idea I can come up with is that the session has
> somehow cached a wrong commit status for the reindexing transaction,
> causing it to believe that both old and new copies of the index's
> pg_class row are dead ... but how could that happen?  The underlying
> state in the catalog is not wrong, because no concurrent sessions are
> upset (at least not in the buildfarm case ... Dave, do you see more than
> one session doing this at a time?).

I was thinking more along the lines of a failure while processing a
sinval message emitted by the REINDEX. The sinval message doesn't get
fully processed and therefore we get confused about what the
relfilenode is for pg_class. If that happened for any other relation,
we could recover by scanning pg_class. But if it happens for pg_class
or pg_class_oid_index, we're toast, because we can't scan them without
knowing what relfilenode to open.

Now that can't be quite right, because of course those are mapped
relations... and I'm pretty sure there are some other holes in my line
of reasoning too. Just thinking out loud...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Johann 'Myrkraverk' Oskarsson 2011-07-29 15:18:14 USECS_* constants undefined with float8 timestamps?
Previous Message jordani 2011-07-29 15:04:52 Incremental checkopints