Re: Hung postmaster (8.3.9)

From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Hung postmaster (8.3.9)
Date: 2010-03-01 23:40:55
Message-ID: 201003011640.55944.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Monday 01 March 2010 @ 16:03, Ed L. wrote:
> On Monday 01 March 2010 @ 15:59, Ed L. wrote:
> > > This just happened again ~24 hours after full reload from
> > > backup. Arrrgh.
> > >
> > > Backtrace looks the same again, same file, same
> > > __read_nocancel(). $PGDATA/global/pg_auth looks fine to
> > > me, permissions are 600, entries are 3 or more
> > > double-quoted items per line each separated by a space,
> > > items 3 and beyond being groups.
> > >
> > > Any clues?
>
> Also seeing lots of postmaster zombies (190 and growing)...

While new connections are hanging, top shows postmaster using
100% of cpu. SIGTERM/SIGQUIT do nothing. Here's a backtrace
of this busy postmaster:

(gdb) bt
#0 0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
#1 0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
#2 0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
#3 0x000000346f8689cb in getc () from /lib64/libc.so.6
#4 0x0000000000531ee8 in next_token (fp=0x10377ae0, buf=0x7fff32230e60 "", bufsz=4096) at hba.c:128
#5 0x0000000000532233 in tokenize_file (filename=0x10359b70 "global", file=0x10377ae0, lines=0x7fff322310f8, line_nums=0x7fff322310f0) at hba.c:232
#6 0x00000000005322e9 in tokenize_file (filename=0x2b1c8cbf5800 "global/pg_auth", file=0x103767a0, lines=0x98b168, line_nums=0x98b170) at hba.c:358
#7 0x00000000005327ff in load_role () at hba.c:959
#8 0x000000000057f878 in sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:3830
#9 <signal handler called>
#10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
#11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
#12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x1033f000) at postmaster.c:1031
#13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188

...and more from the server logs, fwiw:

2010-03-01 17:30:24.213 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:30:31.250 CST [32236] DEBUG: transaction log switch forced (archive_timeout=300)
2010-03-01 17:31:24.216 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:32:24.219 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:33:24.222 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:34:24.225 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:35:19.061 CST [32236] LOG: checkpoint starting: time
2010-03-01 17:35:19.185 CST [32236] DEBUG: recycled transaction log file "000000010000001C00000071"
2010-03-01 17:35:19.185 CST [32236] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled;
write=0.028 s, sync=0.000 s, total=0.124 s
2010-03-01 17:35:24.328 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:35:31.224 CST [32236] DEBUG: transaction log switch forced (archive_timeout=300)
2010-03-01 17:36:44.332 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:37:44.434 CST [32238] WARNING: worker took too long to start; cancelled
2010-03-01 17:37:47.378 CST [3692] dba 10....(42816) dba LOG: could not receive data from client: Connection timed out
2010-03-01 17:37:47.378 CST [3692] dba 10....(42816) dba LOG: unexpected EOF on client connection
2010-03-01 17:37:47.380 CST [3692] dba 10....(42816) dba LOG: disconnection: session time: 2:11:15.303 user=dba database=dba host=... port=428

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2010-03-01 23:49:31 Re: Hung postmaster (8.3.9)
Previous Message Ed L. 2010-03-01 23:03:23 Re: Hung postmaster (8.3.9)

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2010-03-01 23:42:53 Re: scheduler in core
Previous Message Chris Browne 2010-03-01 23:11:48 Re: Anyone know if Alvaro is OK?