Corrupt Table

From: "Bryan White" <bryan(at)arcamax(dot)com>
To: "pgsql-general" <pgsql-general(at)postgreSQL(dot)org>
Subject: Corrupt Table
Date: 2000-09-14 15:50:47
Message-ID: 003b01c01e63$8ce09d20$2dd260d1@arcamax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have apparently picked up a corrupt record in a table.

What happend:
Yesterday at one point the database seems to hang. There were three backend
processes consuming large amounts of CPU time. I stopped the server and
rebooted (3 months since last reboot). The database restarted and seemed to
be fine.

Then last night the nightly backups failed apparently when reading the
'customer' table. The database restarted itself. There have been a couple
of database restarts since then. As far as I can tell it is the customer
table that is the problem.

Here is what a failure looks like in the log file:
--------------------
Server process (pid 2864) exited with status 139 at Thu Sep 14 10:13:11 2000
Terminating any active server processes...
000914.10:13:11.425 [5879] NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died
abnormally....
-------------------
The last entry is repeated multiple times.

I have written a small utility program (pganal). It looks for
inconsistancies in page layout and tuple layout. My original intent was to
parse the tuple internal structure as well but that proved to be more
complex that I was ready to handle at the time.

Anyway I stopped the database, copied the customer file to another directory
and restarted the database. Here is the pganal output from this copy:
--------------------------
Analyzing customer
Page 25878 ERROR: pd_lower is too small
pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
Page 31395 ERROR: pd_lower is too small
pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
Page 32950 ERROR: pd_lower is too small
pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
Tuple 71453.0 Ofs=8029 Len=164 Flags=1 Error: tuple overwrites pd_special
Tuple 71453.4 Ofs=7346 Len=208 Flags=1 Error: tuple overlaps another
Tuple 71453.40 Ofs=1365 Len=160 Flags=1 Error: tuple overlaps another
Page 71958 ERROR: pd_lower has odd value
pd_lower=11886 pd_upper=24239 pd_special=109 pd_opaque.od_pagesize=0
Page 73622 ERROR: pd_lower is too small
pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0

Page Summary
Data Pages = 76555
Unused Pages = 0
New Pages = 0
Empty Pages = 0
Bad Pages = 5
Total Pages = 76560

Tuple Summary
O/L Error Tuples = 1
Overlaped Tuples = 2
Unused Tuples = 47994
Used Tuples = 3698495
Total Tuples = 3746492
--------------------------

I suspect the 'pd_lower is too small' may be just my misunderstanding of the
page layout.
The three tuple errors (all on the same page) and the 'pd_lower has odd
value' error seem to be real.
'pd_lower has odd value' comes from:
int nitems = (phd->pd_lower - sizeof(*phd)) / sizeof(ItemIdData);
if(nitems * sizeof(ItemIdData) != phd->pd_lower - sizeof(*phd))
pderr = "pd_lower has odd value";
Basically it means the pd_lower did not leave room for an integral number of
ItemIDData structures.

I seem to have two separate corrupt pages. I can post the full source to
pganal if anyone is interested. Its about 300 lines.

My question is how do I proceed from here. Going back to the previous day's
backup would be very painful in terms of lost data. I suspect the answer is
to perform surgery on the bad pages and then rebuild indexes but this is a
scary idea. Has anyone else created tools to deal with this kind of
problem?

Bryan White, ArcaMax.com, VP of Technology
You can't deny that it is not impossible, can you.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Stephan Szabo 2000-09-14 16:02:19 Re: Show triggers in psql?
Previous Message Hernan 2000-09-14 15:23:28 testing...(ignore)