Re: DB/clog corruption

From: "Reid Thompson" <Reid(dot)Thompson(at)ateb(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: DB/clog corruption
Date: 2005-07-14 15:53:54
Message-ID: F71C0DC6B4FD3648815AAA7F969E352901530B20@sr002-2kexc.ateb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote:
> "Reid Thompson" writes:
>> $ pg_dump -f table_dump.sql -t table_with_error dbname
>> FATAL 2: read of clog file 1559, offset 0 failed: Success
>
> You've got a clobbered tuple header in that file (ridiculous
> xmin or xmax value). Alternatively, given the evident age of
> this server, it could be a clobbered page header leading to
> the system following a bogus pointer to a tuple header.
>
> I think the track you want to pursue is identifying which
> page contains the bad data and looking at it with pg_filedump
> to see if there's anything recoverable there at all. If not,
> just zeroing out the page is probably the easiest way of
> getting to a dumpable state. You've probably lost this
> particular tuple in any case, you might or might not have lost the
> whole page.
>
> Once you get out of this, you might want to think about
> updating to something newer than 7.2 ...
>
> regards, tom lane

DB corruption resolution.

PG_VERSION: 7.2.3

First, I would like to very much thank Tom Lane for guiding me in
resolving this.

Issues: A procedural error resulted in filesystem full ( out of disk
space )
resulting in inability to create pg_clog file(s)
There also existed a corrupted page in a single table.

Symptoms: pg_dump failure, query "select * from" failure, vacuum
failure
all around "Oh crap"

Disclaimer: If you are having a DB issue, do NOT assume that the steps
outlined below will resolve it for you. The mailing list archives note
that there is not 'one procedure' for recovery, most incidents are
unique in some way. Many incidents may require certain 'same steps' to
be performed, but there is no cookie cutter resolution. Search the
archives, read the threads regarding incidents that are similar to or
match yours and request help from the mailing list if you feel it is
warranted. This synopsis is meant to, hopefully, add to the
understanding one gets when searching the archives for issues it
addresses.

Initial symptom noted:

pg_dump: FATAL 2: open of /postgres/data/pg_clog/0202 failed: No such
file or directory
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table "table_name_here"
failed: PQendcopy() failed.
pg_dump: Error message from server: FATAL 1: The database system is
starting up

pg_dump: The command was: COPY "table_name_here" TO stdout;
ERROR: pg_dump "dbname"
ERROR: backup failed

Before performing any work I used pg_dump to individually dump all other
tables in the database.

First steps were to address the pg_clog file issues.
Note: PG Versions 7.3 and above should not require manual intervention
for pg_clog files after an out of space condition.
Procedures outlined in the mailing list archives provided for this.
Querying the mailing list archives for dd bs= pg_clog , etc should
provide ample reading material on extending the file in the proper bs
size increments for your situation. My pg instance used the default 8k
block size... so,
shutdown the db
dd bs=8k count=1 < /dev/zero >>/postgres/data/pg_clog/0202
startup the db
... Note that the final size of your pg_clog file may need to
something other than 8k, see the various threads in the archives related
to this issue.

Getting past this yielded a different error:

$ pg_dump -f table_dump.sql -t table_with_error dbname
FATAL 2: read of clog file 1559, offset 0 failed: Success
pg_dump: FATAL 2: read of clog file 1559, offset 0 failed: Success
pg_dump: lost synchronization with server, resetting connection
DEBUG: server process (pid 1672) exited with exit code 2
DEBUG: terminating any other active server processes
DEBUG: all server processes terminated; reinitializing shared memory
and semaphores
DEBUG: database system was interrupted at 2005-07-11 08:41:39 EDT
DEBUG: checkpoint record is at 1/669E0528
DEBUG: redo record is at 1/669E0528; undo record is at 0/0; shutdown
TRUE
DEBUG: next transaction id: 1237894; next oid: 2850421
DEBUG: database system was not properly shut down; automatic recovery
in progress FATAL 1: The database system is starting up
pg_dump: SQL command to dump the contents of table "table_with_error"
failed: PQendcopy() failed.
pg_dump: Error message from server: FATAL 1: The database system is
starting up

pg_dump: The command was: COPY "table_with_error" TO stdout;
$ DEBUG: ReadRecord: record with zero length at 1/669E0568
DEBUG: redo is not required
DEBUG: database system is ready

Research in the archives, followed by a post to and receipt of response
from the mailing list, resulted in using oid2name and pg_filedump to
foster a resolution. oid2name was from the contrib source for my pg
version. oid2name was utilized to get the pg filename for the affected
table( 162239 ), and then find was used to determine where the table
file resided ( find /postgres/data -name 162239 ).
After oid2name and find provided /postgres/data/base/1153088/162239 for
the affected table, pg_filedump was utilized to find the error cause.

As an alternative to having to find/install oid2name and use as above,
see http://www.postgresql.org/docs/8.0/static/storage.html for a
description of the database file layout.

Note that pg_filedump has some version specificity depending on what
your pg version is. An archive search should provide links for download
( I used http://sources.redhat.com/rhdb/utilities.html to get the source
). The same search should provide some insight in how to use
pg_filedump for error determination ( the README included gives example
usage also -- read it ).
I shutdown postgresql, copied the table file to a work area and used
pg_filedump -i -f to generate output.

Investigating through the output from pg_filedump on my affected table
yielded the following:

*****************************************************************
* PostgreSQL File/Block Formatted Dump Utility
*
* File: 162239
* Options used: -i -f
*
* Dump created on: Tue Jul 12 14:13:44 2005
*****************************************************************

Block 0 ******************************************************
.... info & data
Block 1
.... info & data
Block N
....
Block 347 ******************************************************
<Header> -----
Block Offset: 0x002b6000 Offsets: Lower 62121 (0xf2a9)
Block Size: 16732 Upper 16732 (0x415c)
LSN: logid 1720768 recoff 0x00001000 Special 62121 (0xf2a9)
Items: 15525 Free Space: 4294921907
Length (including item array): 8192

Error: Invalid header information.

Error: End of block encountered within the header. Bytes read: 8192.

0000: c0411a00 00100000 040f5d41 a9f25c41 .A........]A..\A
0010: a9f25c41 00000000 1a000200 08000000 ..\A............
0020: 00000000 00000000 fe010100 00000000 ................
0030: 00000000 00000000 00000000 00000000 ................
0040: 00000000 00000000 00000000 00000000 ................
0050: 00000000 00000000 00000000 00000000 ................
0060: 00000000 f3cff538 00000000 00000000 .......8........
0070: 00000000 00000000 00000000 00000000 ................
0080: 80811a00 00e00300 a1f25c41 598f823f ..........\AY..?
0090: 598f823f 00000000 1a000100 f8010000 Y..?............
...
lots more 'junk' like above, followed by what appeared to be several
valid records, the end of the Block, and the start of the next Block.
Another search in the archives and response from Tom, and I now know
that I have a corrupt page in the table. Next step is to zero out the
page. The archives describe examples of doing this, and Tom's response
pretty explicitly stated what I needed to do

dd conv=notrunc bs=8k seek=347 count=1 if=/dev/zero of=datafile

Given that my datafile is /postgres/data/base/1153088/162239 ( provided
by the earlier usage of oid2name and find)

cp /postgres/data/base/1153088/162239
/to/a/safe/place/in/case/i/screw/up/so/i/can/recover

dd conv=notrunc bs=8k seek=347 count=1 if=/dev/zero
of=/postgres/data/base/1153088/162239

I then ran "pg_filedump -i -f /postgres/data/base/1153088/162239" and
captured the output again. Block 347 now looks like this.

Block 347 ******************************************************
<Header> -----
Block Offset: 0x002b6000 Offsets: Lower 0 (0x0000)
Block Size: 0 Upper 0 (0x0000)
LSN: logid 0 recoff 0x00000000 Special 0 (0x0000)
Items: -5 Free Space: 0
Length (including item array): 24

Error: Invalid header information.

0000: 00000000 00000000 00000000 00000000 ................
0010: 00000000 00000000 ........

<Data> ------
Error: Item index corrupt on block. Offset: <-5>.

<Special Section> -----
Error: Invalid special section encountered.
Error: Special section points off page. Unable to dump contents.

I then restarted postgresql and successfully pg_dump'd the affected
table. I then pg_dump'd the entire db, and reloaded the db from the
dump.

Browse pgsql-general by date

  From Date Subject
Next Message Richard_D_Levine 2005-07-14 16:07:35 Re: Standalone Parser for PL/pgSQL
Previous Message Thomas F. O'Connell 2005-07-14 15:49:56 Re: ERROR: could not open relation