Re: rebellious pg stats collector (reopened case)

From: Laszlo Nagy <gandalf(at)shopzeus(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject: Re: rebellious pg stats collector (reopened case)
Date: 2008-12-19 08:54:36
Message-ID: 494B614C.5060808@shopzeus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-performance

Alvaro Herrera wrote:
> Laszlo Nagy wrote:
>
>> PostgreSQL 8.3.5, the system is now stable (uptime > 10 days).
>> PostgreSQL stats collector uses 100% CPU forever:
>>
>
> Could you grab a few stack traces from it and post them? Also possibly
> useful, leave strace running on the pgstat process for a while and post
> the output somewhere.
>
Very interesting results. Before I used 'truss' on the process, it was:

78816 pgsql 1 106 0 22400K 7100K CPU6 6 24.2H 99.02%
postgres

After I started "truss -p 78815" the first message I got was:

SIGNAL 17 (SIGSTOP)

and the it waited for some seconds. Then my screen was filled with
messages like this:

poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\^C\0\0\M^T\f\r\0\n"...,1000,0x0,NULL,0x0) =
984 (0x3d8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M^X\^A\0\0\M^T\f\r\0"...,1000,0x0,NULL,0x0) = 408
(0x198)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\b\0\0\0(at)\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64
(0x40)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\b\0\0\0(at)\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64
(0x40)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^O\v\b\0\^B"...,1000,0x0,NULL,0x0) =
216 (0xd8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504
(0x1f8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^T\f\r\0\^B"...,1000,0x0,NULL,0x0) =
216 (0xd8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504
(0x1f8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-X\0\0\0\M^T\f\r\0\^B"...,1000,0x0,NULL,0x0) =
216 (0xd8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\^A\0\0\0\M-x\^A\0\0\0\0\0\0\^E"...,1000,0x0,NULL,0x0) = 504
(0x1f8)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)
recvfrom(7,"\b\0\0\0(at)\0\0\0\0\0\0\0\0\0\0\0"...,1000,0x0,NULL,0x0) = 64
(0x40)
poll({7/POLLIN|POLLERR},1,2000) = 1 (0x1)

I terminated the debugger with Ctrl+C. Then I started it again:

#truss -p 78815
SIGNAL 17 (SIGSTOP)
poll({7/POLLIN|POLLERR},1,2000) = 0 (0x0)
getppid(0x1,0x1,0x7d0,0x8014606cc,0xffffffff80a579c0,0x7fffffffceb8) =
95860 (0x17674)
open("global/pgstat.tmp",O_WRONLY|O_CREAT|O_TRUNC,0666) = 4 (0x4)
fstat(4,{mode=-rw------- ,inode=73223196,size=0,blksize=4096}) = 0 (0x0)
write(4,"\M^W\M-<\M-%\^A\^F\n\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\M-Z\^C\0\0\0\0\0\0\M-d\M^E\M-0u"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0+\0\0\0\0\0\0\0\^N\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\^C\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0T\^T\^N\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\^R\M-(\M^[\v\M-^\M-0A\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\^N\0\0\0\0\0\0\0\\n\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0T\M-A\n\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0~\M-d\M-fp\n\M-^\M-0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\^E\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"$\0\0\0\0\0\0+\^V\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0Z\M-F%"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0|\^B\0\0\0\0\0\0<\M-I\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0T\M-f\t\^V\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\M-H'\M-4\M-N\n\M-^\M-0A"...,4096) = 4096 (0x1000)
write(4,"\0\0\^N\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\b\0\0\0\0\0L\^B\0\0\0\0\0\0\M^^"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0dD\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0>\M-h\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0T%;\r\0\0\0\0\0 \0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\M-I\\j\n\M-^\M-0A\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0Y\240\^A\0\0\0\0\0\^C\M-3"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\^N\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\^A\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0P\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\^A\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\^P\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\M-P\^R\0\0\0"...,4096) = 4096 (0x1000)
write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,3119) = 3119 (0xc2f)
close(4) = 0 (0x0)
rename("global/pgstat.tmp","global/pgstat.stat") = 0 (0x0)
process exit, rval = 0

I did NOT send the SIGSTOP signal to the process. The pgstat collector
was running at 100%CPU for 24 hours, when I attached the debugger, and
attaching the debugger somehow caused the process to get a SIGSTOP, but
the process did not exit. When I started the debugger for the second
time, it got a SIGSTOP again, and this made the collector exiting the
endless loop (instead of stopping it).

What is the next step?

Laszlo

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Alvaro Herrera 2008-12-19 13:04:46 Re: rebellious pg stats collector (reopened case)
Previous Message Alvaro Herrera 2008-12-19 03:09:59 Re: reindexdb dying with SIGPIPE on 8.2.5

Browse pgsql-performance by date

  From Date Subject
Next Message Alvaro Herrera 2008-12-19 13:04:46 Re: rebellious pg stats collector (reopened case)
Previous Message sathiya psql 2008-12-19 08:00:36 Re: pg_restore : out of memory