Server hitting 100% CPU usage, system comes to a crawl.

From: Brian Fehrle <brianf(at)consistentstate(dot)com>
To: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Server hitting 100% CPU usage, system comes to a crawl.
Date: 2011-10-27 18:39:30
Message-ID: 4EA9A562.7020808@consistentstate.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's <IDLE>. On average, at any given time, there are
3 - 6 connections that are actually executing a query, while the rest
are <IDLE>.

About once a day, queries that normally take just a few seconds slow way
down, and start to pile up, to the point where instead of just having
3-6 queries running at any given time, we get 100 - 200. The whole
system comes to a crawl, and looking at top, the CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there are a
large number of postmaster processes at 100% cpu usage (makes sense, at
this point there are 150 or so queries currently executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached

In the past, we noticed that autovacuum was hitting some large tables at
the same time this happened, so we turned autovacuum off to see if that
was the issue, and it still happened without any vacuums running.

We also ruled out checkpoints being the cause.

I'm currently digging through some statistics I've been gathering to see
if traffic increased at all, or remained the same when the slowdown
occurred. I'm also digging through the logs from the postgresql cluster
(I increased verbosity yesterday), looking for any clues. Any
suggestions or clues on where to look for this to see what can be
causing a slowdown like this would be greatly appreciated.

Thanks,
- Brian F

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Karl Wright 2011-10-27 19:13:06 JDBC connections very occasionally hang
Previous Message Martijn van Oosterhout 2011-10-27 16:44:03 Re: PostGIS in a commercial project