Re: Core dump

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dan Moschuk <dan(at)freebsd(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Core dump
Date: 2000-10-12 22:14:33
Message-ID: 28973.971388873@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dan Moschuk <dan(at)freebsd(dot)org> writes:
> It would appear from that very rough test program that solaris doesn't mind
> system calls from within a signal handler.

Still, it's a mighty peculiar backtrace.

After looking at postmaster.c, I see that the postmaster will issue
SIGUSR1 to all remaining backends *each* time it sees a child exit
with nonzero status. And it just so happens that quickdie() chooses
to exit with exit(1) not exit(0). So a new theory is

1. Some backend crashes.

2. Postmaster issues SIGUSR1 to all remaining backends.

3. As each backend gives up the ghost, postmaster gets another wait()
response and issues another SIGUSR1 to the ones that are left.

4. Last remaining backend has been SIGUSR1'd enough times to overrun
stack memory, leading to coredump.

I'm not too enamored of this theory because it doesn't explain the
perfect repeatability shown in your backtrace. It seems unlikely that
each recursive quickdie() call would get just as far as elog's write()
and no farther before the postmaster is able to issue another signal.
Still, it's a possibility.

We should probably tweak the postmaster to be less enthusiastic about
signaling its children repeatedly.

Meanwhile, have you tried looking in the postmaster log? The postmaster
should have logged at least the exit status for the first backend to
fail.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dan Moschuk 2000-10-12 22:24:42 Re: Core dump
Previous Message Marko Kreen 2000-10-12 21:11:32 Re: Precedence of '|' operator (was Re: [patch, rfc] binary operators on integers)