Re: Trying to recover a corrupted database

From: John Scalia <jayknowsunix(at)gmail(dot)com>
To: Scott Whitney <swhitney(at)journyx(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Trying to recover a corrupted database
Date: 2014-07-17 20:26:16
Message-ID: 53C83168.2060504@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Well, nothing to lose, so I did a touch on pg_clog/0CDE and at least the error changed. Now it says:

DETAIL: Could not read from file "pg_clog/0CDE" at offset 98304: Success.

But I don't understand why it says Success?

On 7/17/2014 4:09 PM, Scott Whitney wrote:
> OOhhh...IIRC I think that I created the clog file as a zero byte file to get around that. It was _absolutely_ a step of last resort. I would be certain that you have a backup
> that can get you back to this point at least before proceeding.
>
> http://kb.parallels.com/en/6067
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> I was attempting to do that with one table that I know is broken, but I can't get anything from it in a select. All I get no matter what is:
>
> ERROR: could not access status of transaction 3706060803
> DETAIL: Could not open file "pg_clog/OCDE": No such file or directory
>
> I think the pg_clog files are transaction journal files that tell the Db why a record is or is not in the table, but I'm not certain. I just wish there was a pg_resetclog
> executable like there is for the pg_xlog segments.
>
> On 7/17/2014 3:58 PM, Scott Whitney wrote:
>
> Yeah. I know the feeling, man. What I ended up doing was "select * from" until I found the broken one then jumped down into groups of thousands of records and basically
> "Newtoned" it down until I found the affected data.
>
> It sucked. A lot.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Well, we already have the schema in a canned file. So, we wouldn't need to do a pg_dump for the schema, and I can get into the database using psql obviously. That's
> how I produced the setting from pg_settings. But, I really don't want to have to go through 150K+ records to find the multiple broken ones, if I don't have to do so.
> I'm just hoping for a faster way.
>
> On 7/17/2014 3:30 PM, Scott Whitney wrote:
>
> Well, at this point, let me let the big brains answer your main question on the thread and tell you what _I_ did when I got into a similar situation.
>
> a) pg_dump --schema-only <database name> > db.structure
> b) From here it was a lot of select * from table until I found the broken one(s) and I worked around the actual bad data by offset/limit on the selects on the
> busted tables.
>
> It took me several hours, but it did work (for me).
>
> That's assuming you can even psql to the db in the first place when the postmaster is running.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> We're only working with a single database on this system, and yes, I attempted a pg_dump earlier today on just one single table - it failed (I know pg_dump
> and pg_basebackup use different mechanisms.) Mind you it's a large table with 154,000 rows in it. The overall database is somewhere around 43GB.
>
> On 7/17/2014 3:16 PM, Scott Whitney wrote:
>
> a) How many databases do you have?
> b) Have you tried to pg_dump the database(s) to see if that succeeds?
>
> If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Hi all,
>
> You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user
> and ran an fsck which did
> report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a
> successful pg_basebackup on the
> broken database. Since then we restarted the database and a ps -ef result looks like:
>
> /usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431
>
> After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its
> log: invalid page header in
> block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:
>
> select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
> name | setting | boot_val | reset_val
> ---------------------------------------------------------------------------------
> zero_damaged_pages | on | off | on
>
> Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it
> was on when the first VACUUM
> FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors
> like before, I don't really know
> what to try next other than rerunning the VACUUM FULL again. Help?
> --
> Jay
>
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
>
>
>
>
>
>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Nathan Paul 2014-07-17 20:36:34 Wal E S3 questions
Previous Message John Scalia 2014-07-17 20:05:52 Re: Trying to recover a corrupted database