Re: Quite strange crash

From: ncm(at)zembu(dot)com (Nathan Myers)
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Quite strange crash
Date: 2001-01-09 00:10:30
Message-ID: 20010108161030.B571@store.zembu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 08, 2001 at 12:21:38PM -0500, Tom Lane wrote:
> Denis Perchine <dyp(at)perchine(dot)com> writes:
> >>>>>>> FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
> >>>>>
> >>>>> Were there any errors before that?
>
> > Actually you can have a look on the logs yourself.
>
> Well, I found a smoking gun: ...
> What seems to have happened is that 2501 curled up and died, leaving
> one or more buffer spinlocks locked. ...
> There is something pretty fishy about this. You aren't by any chance
> running the postmaster under a ulimit setting that might cut off
> individual backends after a certain amount of CPU time, are you?
> What signal does a ulimit violation deliver on your machine, anyway?

It's worth noting here that modern Unixes run around killing user-level
processes more or less at random when free swap space (and sometimes
just RAM) runs low. AIX was the first such, but would send SIGDANGER
to processes first to try to reclaim some RAM; critical daemons were
expected to explicitly ignore SIGDANGER. Other Unixes picked up the
idea without picking up the SIGDANGER behavior.

The reason for this common pathological behavior is usually traced
to sloppy resource accounting. It manifests as the bad policy of
having malloc() (and sbrk() or mmap() underneath) return a valid
pointer rather than NULL, on the assumption that most of the memory
asked for won't be used just yet. Anyhow, the system doesn't know
how much memory is really available at that moment.

Usually the problem is explained with the example of a very large
process that forks, suddenly demanding twice as much memory. (Apache
is particularly egregious this way, allocating lots of memory and
then forking several times.) Instead of failing the fork, the kernel
waits for a process to touch memory it was granted and then see if
any RAM/swap has turned up to satisfy it, and then kill the process
(or some random other process!) if not.

Now that programs have come to depend on this behavior, it has become
very hard to fix it. The implication for the rest of us is that we
should expect our processes to be killed at random, just for touching
memory granted, or for no reason at all. (Kernel people say, "They're
just user-level programs, restart them;" or, "Maybe we can designate
some critical processes that don't get killed".) In Linux they try
to invent heuristics to avoid killing the X server, because so many
programs depend on it. It's a disgraceful mess, really.

The relevance to the issue at hand is that processes dying during
heavy memory load is a documented feature of our supported platforms.

Nathan Myers
ncm(at)zembu(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-01-09 01:07:37 Compaq open source database benchmark
Previous Message Thomas Swan 2001-01-08 22:37:59 Install Failure [7.1beta2 tarballs]