Re: Hanging backends and possible index corruption

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bernd Helmle <mailings(at)oopsware(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hanging backends and possible index corruption
Date: 2013-01-25 15:28:16
Message-ID: 20130125152816.GH4289@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-01-25 16:24:52 +0100, Bernd Helmle wrote:
> We are currently analyzing an issue at one of our customers PostgreSQL
> database.
>
> The current version is 9.1.6 (update to 9.1.7 is scheduled for next monday,
> no downtime possible before). It runs on POWER7 (pSeries 740) on an RHEL6.3
> 64-bit LPAR. The packages are built from PGDG SVN sources, no special tweaks
> added. We saw no hardware related errors on this machine, nor any crashes.
>
> What currently happens on this machine are hanging statements (SELECTs and
> INSERTs occasionally) with 100% CPU. After some investigation it turned out
> that the corresponding backends are seeking within an index file over and
> over again in a loop. Looking into the hanging queries i've recognized
> certain keys which seems to have the problem, other keys used in the WHERE
> condition run smoothly. Turning off index and bitmap index scans caused the
> suspicious keys to return results, too.
>
> So i've saved the index file (normal BTree index with a single bigint
> column), did a REINDEX and the problem was gone. Looking at the index file
> with pg_filedump and pgbtreecheck from Alvaro gave me the following output:
>
> pgbtreecheck gives warnings about pages' parents and then loops visiting the
> same pages over and over again:
>
> NOTICE: fast root: block 290 at level 2
> NOTICE: setting page parents
> WARNING: block 12709 already had a parent (8840); new parent is 12177
> WARNING: block 12710 already had a parent (12439); new parent is 10835
> NOTICE: done setting parents
> NOTICE: Checking forward scan of level 0, starting at block 1
>
> -- loop starts
>
> WARNING: right sibling 12710 does not point left to me (11680); points to
> 10924 instead
>
> Looking into the relevant pages and their prev and next pointers give me the
> following:
>
> pg_filedump -i -R 11680 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (12651) Next (12710) Level (0) CycleId (0)
>
> pg_filedump -i -R 10924 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (10923) Next (12710) Level (0) CycleId (0)
>
> pg_filedump -i -R 12710 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (10924) Next (10925) Level (0) CycleId (0)
>
> pg_filedump -i -R 10925 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (12710) Next (10926) Level (0) CycleId (0)
>
> pg_filedump -i -R 12709 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (8849) Next (8850) Level (0) CycleId (0)
>
> pg_filedump -i -R 8840 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (8555) Next (9125) Level (1) CycleId (0)
>
> pg_filedump -i -R 12439 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (11405) Next (11690) Level (1) CycleId (0)
>
> $ pg_filedump -i -R 12177 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (11690) Next (0) Level (1) CycleId (0)
>
> $ pg_filedump -i -R 10835 ~/tmp/100252789 | grep Blocks
> Blocks: Previous (10550) Next (11120) Level (1) CycleId (0)
>
> This symptom happened three times in the last couple of weeks now. Looking
> at the numbers doesn't give me the impression that some flaky hardware could
> be involved. What else can we do to track down this problem, any
> suggestions?

Did you reindex after upgrading to 9.1.6? Did you ever have any crashes
or failovers before upgrading to 9.1.6?
I have seen pretty similar symptoms caused by "Fix persistence marking
of shared buffers during WAL replay" in 9.1.6.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2013-01-25 15:29:15 Re: autovacuum not prioritising for-wraparound tables
Previous Message Heikki Linnakangas 2013-01-25 15:26:37 Re: My first patch! (to \df output)