ERROR: invalid page in block 1226710 of relation base/16750/27244

From: bricklen <bricklen(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: ERROR: invalid page in block 1226710 of relation base/16750/27244
Date: 2015-10-21 18:16:41
Message-ID: CAGrpgQ96pGEisQWRep75hJ6vo31-n1g_MHdPxBUK0P+WUY=T0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

We have run into some corruption in one of our production tables. We know
the cause (a compute node was moved), but now we need to fix the data. We
have backups, but at this point they are nearly a day old, so recovering
from them is a last-resort and will incur significant downtime.
We are running 9.3.9

Following the steps at
http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

I get the following output for ctid, id, other_id, tstamp:

(690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
(690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
(690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244

It appears 690652 is what would be dd'd if that's the route we take. Is
that accurate?

Because the message indicates the corruption is in the table's page, not
the page header, according to the docs zero_damaged_pages probably won't
work.

What are my options?

1). Enable zero_damaged_pages and execute VACUUM FREEZE (and hope).
2). dd the block(s) using the output of the ctid query above.

It is multi-gigabyte table that is extremely heavily used (100's to 1000's
of queries per second) so a VACUUM FULL or CLUSTER are options we'd really
like to avoid if possible. The database is about 250GB, not huge, but big
enough that slaves and backups are time consuming to redo, or recover from.

Will attempting zero_damaged_pages cause any harm as the first step (other
than the obvious destruction of any bad pages)?

Is this the correct command if option #2 is chosen? Can it be executed
against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
conv=notrunc

Thanks,

Bricklen

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jonathan Vanasco 2015-10-21 18:43:56 temporary indexes?
Previous Message Joe Conway 2015-10-21 18:00:02 Re: Configure with Openssl fails