Quick Links

Re: clarifying a few error messages

From:	Thomas F(dot)O'Connell <tfo(at)monsterlabs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: clarifying a few error messages
Date:	2003-01-13 20:51:16
Message-ID:	C2C36EE4-2738-11D7-AD8E-00306596B4E8@monsterlabs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Well, here's my concern: the first postgres outage was caused by the
server rebooting itself. The rest of the server reboots, a few of which
took postgres with them, were caused pre-emptively by our sysadmin.

In looking at snapshots of the activity on the machine surrounding
recent outages (of either postgres or the whole box), it seems that
postgres is one of the culprits.

Right before it went out, memory was almost exhausted. I've seen,
before, the signal 9 error, which results from a server under duress,
right? Could it not be a vicious cycle? I.e., postgres begins consuming
tremendous resources on a machine, the kernel gets frightened and starts
killing procs, including postgres, and reboots? The reboots don't occur
during periods of light load. Only when there are high numbers of both
httpd and postgres connections running.

I'm a little suspicious of blaming the hardware. I think it's more
likely an extremely stressful server environment. I'm just trying to
figure out where to turn next for the diagnostics. Most recently, the
memory usage issue came to light.

As for the bad data on disk, I've got a backup, but how severe are we
talking? By not trusting it, do you mean that it could be flagrantly
wrong (i.e., truly corrupted; bad data), or just out of sync with
whatever writes were last occurring?

-tfo

On Monday, January 13, 2003, at 02:03 , Tom Lane wrote:

> My guess is that you've got hardware problems, most likely bad RAM. The
> SIGSEGV is probably a side-effect of RAM dropping bits unexpectedly ---
> for example, the value of a pointer stored in memory might have changed
> so that it appears to point outside Postgres' valid address space,
> leading to SIGSEGV next time the pointer is used.
>
> The fact that you're seeing unexpected reboots is what points the finger
> at the hardware; evidently the kernel is suffering the same kinds of
> problems. (Or you could believe that your hardware is okay and both the
> kernel and Postgres have suddenly developed severe bugs; but the
> hardware theory seems much more plausible.)
>
>> And is exit code 2 just related to the bad clog?
>
> Yes. This part looks like corrupted data on disk :-( ... likely also a
> side effect of busted RAM. Probably the RAM corrupted a page image that
> was sitting in an in-memory buffer, and then it got written out before
> any other problem was noticed.
>
> I hope you have a recent good backup that you can restore from after you
> fix your hardware. I would not trust what's presently on your disk if I
> were you.
>
> regards, tom lane

In response to

Re: clarifying a few error messages at 2003-01-13 20:03:51 from Tom Lane

Responses

Re: clarifying a few error messages at 2003-01-13 22:30:12 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2003-01-13 21:01:10	Re: GUC/postgresql.conf docs
Previous Message	Thomas F.O'Connell	2003-01-13 20:18:51	Re: GUC/postgresql.conf docs