Autovacuum and invalid page header

From: Ireneusz Pluta <ipluta(at)wp(dot)pl>
To: "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Autovacuum and invalid page header
Date: 2010-05-12 11:11:12
Message-ID: 4BEA8CD0.8070108@wp.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I had a problem with neverending forced autovacuum process, running as
preventing xid wraparound. As I finally (?) found, following some
advices given here: ->
http://forums.enterprisedb.com/posts/list/2028.page, that autovacuum in
question was not just one autovacuum, but many different autovacuums on
the same table cycling in start-fail scenario because of invalid page
headers of some indexes of the table. Manual VACUUM VERBOSE said me
that, now I can also see that in related ERROR-CONTEXT log message
pairs. I dropped the damaged indexes and now that autovacuum seems to
continue without repeating itself.

Anyway, another questions come from that situation and I'd like to
discuss them here and get some advice.

First, it seems that an unnoticed damage of a relation, causing
autovacuum failures, which might initially be harmless for the whole
cluster (as far as only damaged relation and its disk space is
concerned), may lead to a critical situation when a wraparound gets
close, and autovacuum is in prevenitng mode can't freeze its xids. The
forced autovacuum insists on vacuuming the table which can not be
vacuumed and locks itself there. At the same time, autovacuum get
sticked with a database of the damaged relation, and "forgets" about
other databases which might need me vacuumed. From this point of view,
just one damaged table might be a single point of failure of the whole
cluster. Is there any way, other than tracing logs, to prevent such a
situation?

Another issue, rather loosely related to the topic. I suspect that the
cluster may have more invalid page headers like that. They might be
caused by past bad sector failures of one of the drives of my raid
array. So I think it might be a good idea to check the whole cluster,
page by page, for invalid page headers. Is there any ready tool, which,
for instance when given a path to database cluster, would traverse all
cluster directories and files and check all page headers? I probably
answered myself - manual VACUUM [VERBOSE] would do - but it fails when
finds the first invalid header, so I would have. That's why I am looking
for something only scanning and reporting invalid headers.

Yet another issue: how could that happen, the bad sectors of a RAID10
member drive caused page damages? (as far as I am right in this
assumption, of course, but I have a reason to think so). Should not the
RAID array prevent such a file damage?

Thanks

Irek.

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Kevin Grittner 2010-05-12 11:31:15 Re: Connection reset by peer
Previous Message Silvio Brandani 2010-05-12 09:06:32 Connection reset by peer