Re: Defunct postmasters

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Scott <gavin(at)pokerpages(dot)com>, Philip Crotwell <crotwell(at)seis(dot)sc(dot)edu>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Defunct postmasters
Date: 2002-02-25 22:59:02
Message-ID: 25062.1014677942@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Gavin Scott <gavin(at)pokerpages(dot)com> writes:
> We have lately begun having problems with our production database
> running postgres 7.1 on linux kernel v 2.4.17. The system had run
> without incident for many months (there were occasional reboots). Since
> we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
> until Feb 13, when postmaster appeared to stop taking new incoming
> connections. We restarted and then the problem struck again Saturday
> night (Feb 23).

If it happens again, could you attach to the postmaster with gdb and get
a stack trace from it?

> This one sounded like an exact match:
> http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s

After looking again at the thread with Philip Crotwell, I have developed
a theory that might explain the postmaster's failing to reap zombie
(defunct) children right away. The basic loop in the postmaster is to
use select(2) to wait for a connection attempt, and when one occurs,
use accept(2) to establish the connection; then fork off a child process
to deal with the connection, and return to the select(). Zombie
children are supposed to be reaped by the SIGCHLD signal handler, which
we enable only while waiting for select().

The scenario that comes to mind is: suppose that an abortive connection
attempt triggers select() to return a connection-ready indication, but
by the time we reach the accept() call, the kernel has decided the
connection was bogus. (This seems somewhat plausible in the case of
a portscan, much less so for real connection attempts.) The accept()
would then block waiting for another connection attempt to come in.
Until one happened, no SIGCHLD interrupts could be serviced, so you
might see zombie children hanging around after awhile.

The flaw in this idea is that once a second connection attempt does come
in, everything should be perfectly back to normal: the postmaster will
accept it and then resume normal operations. So it's not at all clear
how this could cause your complaint of being unable to accept new
connections.

Nonetheless, Philip did exhibit a stack trace showing the postmaster
waiting at accept(). If someone else can replicate that, I'd start to
think that we had enough material to justify filing a Linux kernel bug
report. Perhaps it's the kernel, not the postmaster, that's wedged
somehow.

I am thinking that it'd be a good idea for the postmaster to run the
listening socket in nonblock mode, which should theoretically prevent
the accept() call from blocking when there's no new connection
available. It's not clear whether that would be a workaround for a
kernel bug, if there is one --- but it might be worth trying.

> Also, does anyone know any reason to NOT upgrade to 7.2?

The only significant glitch I've heard of is that pg_dump and psql have
a little disagreement over the handling of mixed-case database names and
user names. If you have any, you might have to hand-edit your pg_dump
script (put double quotes around such names in \connect lines) before
you can reload the database.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2002-02-25 23:09:21 Re: Sort problem
Previous Message Martin Dillard 2002-02-25 22:35:21 Re: scaling a database

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Barwick 2002-02-25 23:13:14 psql and output from \?
Previous Message Neil Padgett 2002-02-25 21:43:30 Re: Implementation Proposal For Add Free Behind