From: | Scott Marlowe <smarlowe(at)g2switchworks(dot)com> |
---|---|
To: | "Ed L(dot)" <pgsql(at)bluepolka(dot)net> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: db corruption/recovery help |
Date: | 2005-06-06 21:17:36 |
Message-ID: | 1118092656.5758.115.camel@state.g2switchworks.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 2005-06-06 at 15:16, Ed L. wrote:
> Someone flipped a breaker switch, and evidently triggered
> corruption in one of our major clusters:
>
> $ cat server_log.Mon
> postmaster successfully started
> 2005-06-06 14:31:11.950 [20124] LOG: database system was interrupted being in recovery at 2005-06-06 14:29:01 EDT
> This probably means that some data blocks are corrupted
> and you will have to use the last backup for recovery.
> 2005-06-06 14:31:11.950 [20124] LOG: checkpoint record is at EF/EBB7AFC8
> 2005-06-06 14:31:11.950 [20124] LOG: redo record is at EF/EBA91EF0; undo record is at 0/0; shutdown FALSE
> 2005-06-06 14:31:11.950 [20124] LOG: next transaction id: 577477594; next oid: 89750885
> 2005-06-06 14:31:11.951 [20124] LOG: database system was not properly shut down; automatic recovery in progress
> 2005-06-06 14:31:11.952 [20124] LOG: redo starts at EF/EBA91EF0
> 2005-06-06 14:31:11.984 [20124] PANIC: Invalid page header in block 22376 of 79189398
> 2005-06-06 14:31:12.275 [20121] LOG: startup process (pid 20124) was terminated by signal 6
> 2005-06-06 14:31:12.275 [20121] LOG: aborting startup due to startup process failure
>
> We have backups from 10 hours earlier, but the obvious
> question: what, if anything, can I do now to salvage those
> 10 hours of data from this?
>
> I guess I could zero the block? I'm a little uncertain since
> I don't know what I'm zeroing (pg_database? pg_class?), and
> can't start up to see what that relfilenode maps to...
>
> Going to look at it with pg_filedump, maybe oid2filename or
> whatever that utility is...
OK, if postgresql is running on hardware that doe NOT lie about
fsyncing, and it is set to fsync, this should NEVER happen.
The fact that it happened means either A: the hardware is lying / broken
(like default IDE drives or a RAID controller with a non-battery backed
cache) or B: the database was set to not fsync. Since the default IS to
fsync, this would require effort on your part to turn it off.
Now, once the database comes up like this, your data may or may not be
salvagable, and may or may not be coherent. I.e. you may have orphaned
records where you shouldn't, or duplicate entries in some unique index
or something like that.
Look for pg_resetxlog. There's a man page for it on my machine, but I'm
not sure about 7.3.x
Before doing anything, make a file system level backup of all your data
so you have a pristine set to play about with.
From | Date | Subject | |
---|---|---|---|
Next Message | Ed L. | 2005-06-06 21:29:24 | Re: db corruption/recovery help |
Previous Message | Ed L. | 2005-06-06 21:02:05 | Re: db corruption/recovery help |