Defunct postmasters

From: Gavin Scott <gavin(at)pokerpages(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Defunct postmasters
Date: 2002-02-25 20:51:42
Message-ID: 1014670303.13536.95.camel@gavin.pokerpages.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hi,

We have lately begun having problems with our production database
running postgres 7.1 on linux kernel v 2.4.17. The system had run
without incident for many months (there were occasional reboots). Since
we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem
until Feb 13, when postmaster appeared to stop taking new incoming
connections. We restarted and then the problem struck again Saturday
night (Feb 23).

In both instances attempting to access the db via the psql commandline
would just hang -- no error messages were printed. Also we have two
perl scripts running that connect to the database once every few
minutes; one runs on a remote server the other locally. Both create log
files and appeared to be stuck trying to make a connection.

In the 2nd incident /var/log/postgresql.log contained:

Sat Feb 23 23:41:00 CST 2002
PacketReceiveFragment: read() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer
Sat Feb 23 23:51:00 CST 2002
pq_recvbuf: recv() failed: Connection reset by peer
pq_recvbuf: recv() failed: Connection reset by peer

23:40 appears to have been when the problem began. I added a cron job to
put the date lines in the above; in the 1st incident I didn't have that
so it was difficult to tell what was happening when the problem began;
it did contain messages similar to the above but I can't guarantee they
were produced at the time of the problem.

dmesg both on the postgres machine and our remote server which accesses
it via the script mentioned above showed a couple of lines like:

sending pkt_too_big to self
sending pkt_too_big to self

Since there aren't any timestamps in dmesg I can't guarantee that those
were produced at the time of incident. Also I did not check dmesg
during the 1st incident.

In both incidences there were multiple zombies hanging around:

postgres 21264 0.0 0.0 0 0 ? Z Feb23 0:00
[postmaster <defunct>]
postgres 21266 0.0 0.0 0 0 ? Z Feb23 0:00
[postmaster <defunct>]

The system was mostly idle at the time I began investigating both
incidents.

While searching the mailing list archives I did find 2 threads that
seemed to reference similar problems.

This one sounded like an exact match:
http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s
There were similar elements mentioned here:
http://archives.postgresql.org/pgsql-hackers/2002-01/msg01142.php

I was especially intrigued by this quote from Tom Lane in the 2nd link:

"It sounds like the postmaster got into a state where it was not
responding to SIGCHLD signals. We fixed one possible cause of that
between 7.1 and 7.2, but without a more concrete report I have no way to
know if you saw the same problem or a different one. I'd have expected
connection attempts to unwedge the postmaster in any case."

Does anyone have any idea what might be causing our problem and whether
or now upgrading to 7.2 might solve it?

Also, does anyone know any reason to NOT upgrade to 7.2? I've only
recently joined this list, so I may have overlooked outstanding known
problems with 7.2.

Thanks,
Gavin Scott
gavin(at)pokerpages(dot)com

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dustin Sallings 2002-02-25 21:09:19 deadlock problem
Previous Message Jan Wieck 2002-02-25 20:50:02 Re: [HACKERS] Nice Oracle tuning article

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2002-02-25 21:40:10 Re: More time zones
Previous Message Jan Wieck 2002-02-25 20:50:02 Re: [HACKERS] Nice Oracle tuning article