Re: [ADMIN] Streaming Replication Server Crash

From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: raghu ram <raghuchennuru(at)gmail(dot)com>, pgsql-admin(at)postgresql(dot)org, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: [ADMIN] Streaming Replication Server Crash
Date: 2012-10-23 05:03:33
Message-ID: 50862525.5060904@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

On 10/22/2012 08:52 PM, Tom Lane wrote:
> Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> writes:
>> On 10/19/2012 04:40 PM, raghu ram wrote:
>>> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process
>>> (PID 15565) was terminated by signal 10
>
>> That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.
>
>> Was the server intentionally sent SIGUSR1 by an admin? Do you know what
>> triggered the signal?
>
> SIGUSR1 is used for all sorts of internal cross-process signaling
> purposes. There's no need to hypothesize any external force sending
> it; if somebody had broken a PG process's signal handling setup for
> SIGUSR1, a crash of this sort could be expected in short order.
>
> But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
> AFAIK, that signal number is not at all compatible across different
> flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)

Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.

A bit of searching suggests that on Solaris/SunOS, signal 10 is SIGBUS:

http://www.s-gms.ms.edus.si/cgi-bin/man-cgi?signal+3HEAD
http://docs.oracle.com/cd/E23824_01/html/821-1464/signal-3head.html

... which tends to suggest an entirely different interpretation than
"someone broke a signal hander":

https://blogs.oracle.com/peteh/entry/sigbus_versus_sigsegv_according_to

such as:

- Bad mmap()ed read
- alignment error
- hardware fault

so it's not immensely different to a segfault in that it can be caused
by errors in hardware, OS, or applications.

Raghu, did PostgreSQL dump a core file? If it didn't, you might want to
enable core dumps in future. If it did dump a core, attaching a debugger
to the core file might tell you where it crashed, possibly offering some
more information to diagnose the issue. I'm not familiar enough with
Solaris to offer detailed advice on that, especially as you haven't
mentioned your Solaris version, how you installed Pg, etc. This may be
of some use:

http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris

--
Craig Ringer

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Craig Ringer 2012-10-23 05:08:10 Re: Streaming Replication Server Crash
Previous Message Tom Lane 2012-10-22 12:52:38 Re: [ADMIN] Streaming Replication Server Crash

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2012-10-23 05:08:10 Re: Streaming Replication Server Crash
Previous Message Scott Marlowe 2012-10-22 22:51:36 Re: Plug-pull testing worked, diskchecker.pl failed