Re: corruption diag/recovery, pg_dump crash

From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: corruption diag/recovery, pg_dump crash
Date: 2003-12-06 21:45:40
Message-ID: 200312061445.40643.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Maybe worth mentioning the system has one 7.2.3 cluster, five 7.3.2
clusters, twelve 7.3.4 clusters, all with data on same partition/device,
and all corruption has occurred on only five of the twelve 7.3.4 clusters.

TIA.

On Saturday December 6 2003 2:30, Ed L. wrote:
> We are seeing what looks like pgsql data file corruption across multiple
> clusters on a RAID5 partition on a single redhat linux 2.4 server running
> 7.3.4. System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and
> 7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy
> with hundreds and sometimes > 1000 simultaneous connections. After ~250
> days of continuous, flawless uptime operations, we recently began seeing
> major performance degradation accompanied by messages like the following:
>
> ERROR: Invalid page header in block NN of some_relation (10-15
> instances)
>
> ERROR: XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance
> repeated many times)
>
> I think I've been able to repair most of the "Invalid page header" errors
> by rebuilding indices or truncating/reloading tabledata. The XLogFlush
> error was occuring for a particular index, and a drop/reload has at least
> ceased that error. Now, a pg_dump error is occurring on one cluster
> preventing a successful dump. Of course, it's gone unnoticed long enough
> to rollover our good online backups and the bazillion-dollar
> offline/offsite backup system wasn't working properly. Here's the
> pg_dump output, edited to protect the guilty:
>
> pg_dump: PANIC: open of .../data/pg_clog/04E5 failed: No such file or
> directory
> pg_dump: lost synchronization with server, resetting connection
> pg_dump: WARNING: Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
> died abnormally and possibly corrupted ... blah blah
> pg_dump: SQL command to dump the contents of table "sometable" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: server closed the connection
> unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout;
> pg_dumpall: pg_dump failed on somedb, exiting
>
> Why that 04E5 file is missing, I haven't a clue. I've attached an "ls
> -l" for the pg_clog dir.
>
> Past list discussions suggest this may be an elusive hardware issue. We
> did find a msg in /var/log/messages...
>
> kernel: ISR called reentrantly!!
>
> which some here have found newsgroup reports of connection to some sort
> of raid/bios issue. We've taken the machine offline and conducted
> extensive hardware diagnostics on RAID controller, filesystem (fsck),
> RAM, and found no further indication of hardware failure. The machine
> had run flawlessly for these ~20 clusters for ~250 days until cratering
> yesterday amidst these errors and absurd system (disk) IO sluggishness.
> Upon reboot and upgrades, the machine continues to exhibit infrequent
> corruption (or infrequently discovered). Based on hardware vendor (Dell)
> support folks, we've upgraded our kernel (now 2.4.20-24.7bigmem), several
> drivers, raid controller firmware, rebooted, etc. The disk IO
> sluggishness has largely diminished, but we're still seeing the Invalid
> page header pop-up anew, albeit infrequently. The XLogFlush error seems
> to have gone away with the reconstruction of an index.
>
> Current plan is to get as much data recovered as possible, and then do
> significant hardware replacements (along with more frequent planned
> reboots and more vigilant backups).
>
> Any clues/suggestions for recovering this data or fixing other issues
> would be greatly appreciated.
>
> TIA.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Martijn van Oosterhout 2003-12-06 23:05:26 Re: corruption diag/recovery, pg_dump crash
Previous Message Ed L. 2003-12-06 21:30:37 corruption diag/recovery, pg_dump crash