Re: Explained by known hardware failures, or keep looking?

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Explained by known hardware failures, or keep looking?
Date: 2007-06-18 20:52:53
Message-ID: 1182199973.6855.279.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Mon, 2007-06-18 at 14:41 -0500, Kevin Grittner wrote:

> [2007-06-14 11:31:05.986 CDT] 6781 LOG: redo starts at 1D2/6C739064
> [2007-06-14 11:31:46.533 CDT] 6781 WARNING: invalid page header in block 182566 of relation "1523860"; zeroing out page
> [2007-06-14 11:31:46.533 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1523860; tid 182566/92; oth 182563; rgh 115741
> [2007-06-14 11:31:56.228 CDT] 6781 WARNING: invalid page header in block 182567 of relation "1523860"; zeroing out page
> [2007-06-14 11:31:56.229 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1523860; tid 182567/94; oth 182128; rgh 114655
> [2007-06-14 11:32:04.964 CDT] 6781 WARNING: invalid page header in block 123644 of relation "1524189"; zeroing out page
> [2007-06-14 11:32:04.964 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1524189; tid 123644/101; oth 123634; rgh 106665
> [2007-06-14 11:32:11.327 CDT] 6781 WARNING: invalid page header in block 356562 of relation "1524219"; zeroing out page
> [2007-06-14 11:32:11.327 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1524219; tid 356562/58; oth 356549; rgh 34892
> [2007-06-14 11:32:14.795 CDT] 6781 LOG: record with zero length at 1D2/70C31890
> [2007-06-14 11:32:14.795 CDT] 6781 LOG: redo done at 1D2/70C31868
> [2007-06-14 11:32:33.833 CDT] 6781 LOG: database system is ready

I can potentially believe that this could be caused by blocks that were
written to, but not yet flushed at checkpoint. This could happen if the
blocks were reasonably heavily used, say as right-edge of index for two
connected tables at time of crash. I've got no diagnostics to back that
up, however detailed the logs look. Other explanations welcome.

> Could all of this be reasonably explained by the controller failure and/or the subsequent abrupt power loss, or should I be looking for another cause? Personally, as I look at this, I'm suspicious that either the controller didn't persist dirty pages in the June 14th failure or there is some ongoing hardware problem.

Yes. The controller failure means data loss. PostgreSQL doesn't have a
disk check utility because your data is never at risk from us when
running with full transaction guarantees (ref new feature in 8,3), but
the disk failure has meant stuff you thought was on disk wasn't really.

So your DB has holes in it and you need to recover/failover/pull-hair.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Rodrigo De León 2007-06-18 21:17:59 Re: Postgres VS Oracle
Previous Message Tom Lane 2007-06-18 20:25:24 Re: Explained by known hardware failures, or keep looking?