Hanging backends and possible index corruption

From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Hanging backends and possible index corruption
Date: 2013-01-25 15:24:52
Message-ID: 41AD83B5DEFB895033035BF1@apophis.credativ.lan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We are currently analyzing an issue at one of our customers PostgreSQL
database.

The current version is 9.1.6 (update to 9.1.7 is scheduled for next monday,
no downtime possible before). It runs on POWER7 (pSeries 740) on an RHEL6.3
64-bit LPAR. The packages are built from PGDG SVN sources, no special
tweaks added. We saw no hardware related errors on this machine, nor any
crashes.

What currently happens on this machine are hanging statements (SELECTs and
INSERTs occasionally) with 100% CPU. After some investigation it turned out
that the corresponding backends are seeking within an index file over and
over again in a loop. Looking into the hanging queries i've recognized
certain keys which seems to have the problem, other keys used in the WHERE
condition run smoothly. Turning off index and bitmap index scans caused the
suspicious keys to return results, too.

So i've saved the index file (normal BTree index with a single bigint
column), did a REINDEX and the problem was gone. Looking at the index file
with pg_filedump and pgbtreecheck from Alvaro gave me the following output:

pgbtreecheck gives warnings about pages' parents and then loops visiting
the same pages over and over again:

NOTICE: fast root: block 290 at level 2
NOTICE: setting page parents
WARNING: block 12709 already had a parent (8840); new parent is 12177
WARNING: block 12710 already had a parent (12439); new parent is 10835
NOTICE: done setting parents
NOTICE: Checking forward scan of level 0, starting at block 1

-- loop starts

WARNING: right sibling 12710 does not point left to me (11680); points to
10924 instead

Looking into the relevant pages and their prev and next pointers give me
the following:

pg_filedump -i -R 11680 ~/tmp/100252789 | grep Blocks
Blocks: Previous (12651) Next (12710) Level (0) CycleId (0)

pg_filedump -i -R 10924 ~/tmp/100252789 | grep Blocks
Blocks: Previous (10923) Next (12710) Level (0) CycleId (0)

pg_filedump -i -R 12710 ~/tmp/100252789 | grep Blocks
Blocks: Previous (10924) Next (10925) Level (0) CycleId (0)

pg_filedump -i -R 10925 ~/tmp/100252789 | grep Blocks
Blocks: Previous (12710) Next (10926) Level (0) CycleId (0)

pg_filedump -i -R 12709 ~/tmp/100252789 | grep Blocks
Blocks: Previous (8849) Next (8850) Level (0) CycleId (0)

pg_filedump -i -R 8840 ~/tmp/100252789 | grep Blocks
Blocks: Previous (8555) Next (9125) Level (1) CycleId (0)

pg_filedump -i -R 12439 ~/tmp/100252789 | grep Blocks
Blocks: Previous (11405) Next (11690) Level (1) CycleId (0)

$ pg_filedump -i -R 12177 ~/tmp/100252789 | grep Blocks
Blocks: Previous (11690) Next (0) Level (1) CycleId (0)

$ pg_filedump -i -R 10835 ~/tmp/100252789 | grep Blocks
Blocks: Previous (10550) Next (11120) Level (1) CycleId (0)

This symptom happened three times in the last couple of weeks now. Looking
at the numbers doesn't give me the impression that some flaky hardware
could be involved. What else can we do to track down this problem, any
suggestions?

--
Thanks

Bernd

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2013-01-25 15:26:10 Re: BUG #6510: A simple prompt is displayed using wrong charset
Previous Message Bruce Momjian 2013-01-25 15:24:33 Using COPY FREEZE with pg_restore --single-transaction