Re: database corruption

From: Chris Travers <chris(at)travelamericas(dot)com>
To: Ian Westmacott <ianw(at)intellivid(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject: Re: database corruption
Date: 2005-04-16 01:29:13
Message-ID: 42606A69.9010102@travelamericas.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Ian;

I think it is important to figure out why this is happening. I would
not want to run any production databases on systems that were failing
like this.

I am trying to figure out what are the likely causes of the errors...

1) Any other computers suffer random application crashes, power downs,
etc. in your building?
2) I take it there are no Raid controllers involved?
3) RAM is non-ECC?
4) Are the systems on UPS's?

If I could make a wild (and probably wrong) guess, I would wonder if
something external to the system (like electrical supply) was
introducing glitches into memory, causing bad data to be written. I am
only mentioning it because I have implicated electrical supply in other
cases where rare computer failurres weer affecting many systems...

Ian Westmacott wrote:

>For several weeks now we have been experiencing fairly
>severe database corruption upon clean reboot. It is very
>repeatable, and the corruption is of the following forms:
>
>ERROR: could not access status of transaction foo
>DETAIL: could not open file "bar": No such file or directory
>
>ERROR: invalid page header in block foo of relation "bar"
>
>ERROR: uninitialized page in block foo of relation "bar"
>
>
>At first, we believed this was related to XFS, and have
>been pursuing investigations along those lines. However,
>we have now experienced the exact same problem with JFS.
>
>Here are some details:
>
>- Postgres 7.4.2
>- 2.6.6 kernel.org kernel
>- dedicated database partition
>- repeatable with XFS and JFS (have not seen on ext3)
>- repeatable with and without Linux software RAID 0
>- repeatable with IDE and SATA
>- repeatable with and without fsync, and with fdatasync
>- repeatable on multiple systems
>
>
>I have two questions:
>
>- any known reason why this might be occurring? (we must
> have something wrong, for this high rate of severe
> error).
>
>- if I don't care about losing data, and am not interested
> in trying to recover anything, how can I arrange for
> Postgres to proceed normally? I know about
> zero_damaged_pages, but this doesn't help with missing
> transaction files and such. Is there any way to get
> Postgres to chuck anything bad and proceed?
>
>Thanks,
>
> --Ian
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>
>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Ian Westmacott 2005-04-16 03:39:26 Re: database corruption
Previous Message Chris Hoover 2005-04-15 21:49:07 Re: Help installing 8.0.2 rpms on RH 3.0