Backends stalled in 'startup' state: index corruption

From: Greg Sabino Mullane <greg(at)endpoint(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Backends stalled in 'startup' state: index corruption
Date: 2012-05-24 17:33:25
Message-ID: 20120524173325.GA5533@tinybird.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Yesterday I had a client that experienced a sudden high load on
one of their servers (8.3.5 - yes, I know. Those of you with
clients will understand). When I checked, almost all connections
were in a "startup" state, very similar to this thread:

http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

Running a strace showed a lot of semop activity, and the logs showed a
successful connection, then a 5 minute plus wait before a query was issued.
So obviously, blocking on something. Unlike the thread above, I *did* find
problems in the system catalogs. For example, both pg_class and pg_index
gave warnings like this for every index during a VACUUM FULL
VERBOSE tablename:

WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions,
but table contains 9471 row versions
HINT: Rebuild the index with REINDEX.

A REINDEX did not solve the problem (initially), as a REINDEX followed
by a VAC showed the same warning and hint.

The next step was dropping to standalone mode, but before that could
be done, the REINDEXes fixed the problem (no warnings, no stalled
connections). So my questions are:

* Why would a REINDEX not fix the problem as the hint suggested?
Why would it then start "working"?

* What exactly is the relationship between bad indexes and shared
memory locks?

* Is there some other emergency fix when it happens, such as killing
all backends and hoping you kill the one that is actually holding
the lock (if any was).

* Did anything in the 8.3 series fix this? I saw nothing relevant in
the release notes for everything up to 8.3.18 (which it will be on soon).

--
Greg Sabino Mullane greg(at)endpoint(dot)com
End Point Corporation
PGP Key: 0x14964AC8

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alex Hunsaker 2012-05-24 17:36:20 Re: plperl_helpers.h fix for clang
Previous Message Florian Pflug 2012-05-24 17:31:37 Re: [RFC] Interface of Row Level Security