Re: Server hitting 100% CPU usage, system comes to a crawl.

From: Brian Fehrle <brianf(at)consistentstate(dot)com>
To: Scott Mead <scottm(at)openscg(dot)com>
Cc: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Server hitting 100% CPU usage, system comes to a crawl.
Date: 2011-10-27 20:36:50
Message-ID: 4EA9C0E2.2030000@consistentstate.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/27/2011 02:27 PM, Scott Mead wrote:
>
>
> On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle
> <brianf(at)consistentstate(dot)com <mailto:brianf(at)consistentstate(dot)com>> wrote:
>
> Hi all, need some help/clues on tracking down a performance issue.
>
> PostgreSQL version: 8.3.11
>
> I've got a system that has 32 cores and 128 gigs of ram. We have
> connection pooling set up, with about 100 - 200 persistent
> connections open to the database. Our applications then use these
> connections to query the database constantly, but when a
> connection isn't currently executing a query, it's <IDLE>. On
> average, at any given time, there are 3 - 6 connections that are
> actually executing a query, while the rest are <IDLE>.
>
>
> Remember, when you read pg_stat_activity, it is showing you query
> activity from that exact specific moment in time. Just because it
> looks like only 3-6 connections are executing, doesn't mean that 200
> aren't actually executing < .1ms statements. With such a beefy box, I
> would see if you can examine any stats from your connection pooler to
> find out how many connections are actually getting used.
>
Correct, we're getting a few hundred transactions per second, but under
normal operation, polling pg_stat_activity will show the average of 3 -
6 queries that were running at that moment, and those queries run for an
average of 5 - 7 seconds. So my belief is that something happens to the
system where either a) We get a ton more queries than normal from the
application (currently hunting down data to support this), or b) the
overall speed of the system slows down so that all queries increase in
time so much that polling pg_stat_activity lets me actually see them.

>
> About once a day, queries that normally take just a few seconds
> slow way down, and start to pile up, to the point where instead of
> just having 3-6 queries running at any given time, we get 100 -
> 200. The whole system comes to a crawl, and looking at top, the
> CPU usage is 99%.
>
> Looking at top, I see no SWAP usage, very little IOWait, and there
> are a large number of postmaster processes at 100% cpu usage
> (makes sense, at this point there are 150 or so queries currently
> executing on the database).
>
> Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
> Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi,
> 0.3%si, 0.2%st
> Mem: 134217728k total, 131229972k used, 2987756k free, 462444k
> buffers
> Swap: 8388600k total, 296k used, 8388304k free, 119029580k
> cached
>
>
> In the past, we noticed that autovacuum was hitting some large
> tables at the same time this happened, so we turned autovacuum off
> to see if that was the issue, and it still happened without any
> vacuums running.
>
> That was my next question :)
>
>
> We also ruled out checkpoints being the cause.
>
> How exactly did you rule this out? Just because a checkpoint is over
> doesn't mean that it hasn't had a negative effect on the OS cache. If
> you're stuck going to disk, that could be hurting you (that being
> said, you do point to a low I/O wait above, so you're probably correct
> in ruling this out).
>
Checkpoint settings were set to the default per install. 5 minute
timeout, 0.5 completion target, and 30s warning. Looking at the logs, we
were getting a checkpoint every 5 minutes on the dot.

I looked at the data in pg_stat_database and noticed that buffers
written by checkpoints are near 4X that of the background writer. So I
implemented some changes to get more to be written by the background
writer, including increasing the checkpoint timeout to 30 minutes, and
setting the frequency of the bgwriter wait time from 200ms to 50ms.

checkpoints now happen 30 mins apart on the dot, and there was not a
checkpoint happening the last time this issue of major slowdown occured.

>
> I'm currently digging through some statistics I've been gathering
> to see if traffic increased at all, or remained the same when the
> slowdown occurred. I'm also digging through the logs from the
> postgresql cluster (I increased verbosity yesterday), looking for
> any clues. Any suggestions or clues on where to look for this to
> see what can be causing a slowdown like this would be greatly
> appreciated.
>
> Are you capturing table-level stats from pg_stat_user_[tables |
> indexes]? Just because a server doesn't look busy doesn't mean that
> you're not doing 1000 index scans per second returning 1000 tuples
> each time.
>
I am not grabbing any of those at the moment, I'll look into those.

- Brian F
> --Scott
>
> Thanks,
> - Brian F
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org
> <mailto:pgsql-general(at)postgresql(dot)org>)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2011-10-27 20:50:36 Re: Server hitting 100% CPU usage, system comes to a crawl.
Previous Message Adrian Schreyer 2011-10-27 20:31:41 Custom data type in C with one fixed and one variable attribute