Re: Core dump

From: Dan Moschuk <dan(at)freebsd(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dan Moschuk <dan(at)freebsd(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Core dump
Date: 2000-10-12 20:47:53
Message-ID: 20001012164752.A3004@spirit.jaded.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


| > Sparc solaris 2.7 with postgres 7.0.2
| > It seems to be reproducable, the server crashes on us at a rate of about
| > every few hours.
|
| That's a very bizarre backtrace. Why the multiple levels of recursive
| entry to the quickdie() signal handler? I wonder if you aren't looking
| at some kind of Solaris bug --- perhaps it's not able to cope with a
| signal handler turning around and issuing new kernel calls.

I'm not sure that is the issue, see below.

| The core file you are looking at is probably *not* from the original
| failure, whatever that is. The sequence is probably
|
| 1. Some backend crashes for unknown reason, dumping core.
|
| 2. Postmaster observes messy death of a child, decides that mass suicide
| followed by restart is called for. Postmaster sends SIGUSR1 to all
| remaining backends to make them commit hara-kiri.
|
| 3. One or more other backends crash trying to obey postmaster's command.
| The corefile left for you to examine comes from whichever crashed
| last.
|
| So there are at least two problems here, but we only have evidence of
| the second one.
|
| Since the problem is fairly reproducible, I'd suggest you temporarily
| dike out the elog(NOTICE) call in quickdie() (in
| src/backend/tcop/postgres.c), which will probably allow the backends
| to honor SIGUSR1 without dumping core. Then you have a shot at seeing
| the core from the original failure.

I will try this, however the database is currently running under light load.
Only under high load does postgres start to choke, and eventually die.

| Assuming that this works (ie, you find a core that's not got anything
| to do with quickdie()), I'd suggest an inquiry to Sun about whether
| their signal handler logic hasn't got a problem with write() issued
| from inside a signal handler. Meanwhile let us know what the new
| backtrace shows.

I wrote a quick test program to test this theory. Below is the code and the
output.

#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>

static void moo (int);

int
main (void)
{
signal(SIGUSR1, moo);
raise(SIGUSR1);
}

static void
moo (cow)
int cow;
{
printf("Getting ready for write()\n");
write(STDOUT_FILENO, "Hello!\n", 7);
printf("Done.\n");
}

static void
moo (cow)
int cow;
{
printf("Getting ready for write()\n");
write(STDOUT_FILENO, "Hello!\n", 7);
printf("Done.\n");
}

eclipse% ./x
Getting ready for write()
Hello!
Done.
eclipse%

It would appear from that very rough test program that solaris doesn't mind
system calls from within a signal handler.

--
Man is a rational animal who always loses his temper when he is called
upon to act in accordance with the dictates of reason.
-- Oscar Wilde

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2000-10-12 21:11:32 Re: Precedence of '|' operator (was Re: [patch, rfc] binary operators on integers)
Previous Message Marko Kreen 2000-10-12 20:30:48 Re: [patch,rfc] binary operators on integers