100% cpu usage on some postmaster processes kill the complete database

From: Paul Dunkler <paul(dot)dunkler(at)xyrality(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: 100% cpu usage on some postmaster processes kill the complete database
Date: 2012-03-01 16:41:08
Message-ID: 3887D5AB-5997-47C6-AB27-367FAEA90BAA@xyrality.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi List,

we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.

In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...

Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any update

I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:

> top - 15:55:02 up 59 days, 37 min, 1 user, load average: 35.95, 14.04, 7.32
> Tasks: 2417 total, 54 running, 2363 sleeping, 0 stopped, 0 zombie
> Cpu(s): 6.3%us, 1.0%sy, 0.0%ni, 90.2%id, 1.9%wa, 0.0%hi, 0.6%si, 0.0%st
> Mem: 264523700k total, 250145228k used, 14378472k free, 207032k buffers
> Swap: 2097144k total, 553624k used, 1543520k free, 166905748k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 29852 postgres 20 0 131g 59m 35m R 100.0 0.0 1:27.71 postmaster
> 29854 postgres 20 0 131g 70m 45m R 100.0 0.0 1:35.43 postmaster
> 17449 postgres 20 0 131g 1.2g 1.2g R 100.0 0.5 1:52.62 postmaster
> 29868 postgres 20 0 131g 1.1g 1.0g R 100.0 0.4 1:58.93 postmaster
> 30136 postgres 20 0 131g 77m 52m R 100.0 0.0 1:34.33 postmaster
> 30294 postgres 20 0 131g 66m 41m R 100.0 0.0 1:33.33 postmaster
> 30864 postgres 20 0 131g 66m 41m R 100.0 0.0 1:36.17 postmaster
> 30872 postgres 20 0 131g 61m 36m R 100.0 0.0 1:26.81 postmaster
> 30876 postgres 20 0 131g 68m 43m R 100.0 0.0 1:33.97 postmaster
> 30899 postgres 20 0 131g 68m 44m R 100.0 0.0 1:38.95 postmaster
> 30906 postgres 20 0 131g 67m 42m R 100.0 0.0 1:27.82 postmaster
> 31173 postgres 20 0 131g 68m 44m R 100.0 0.0 1:28.49 postmaster
> 31239 postgres 20 0 131g 71m 46m R 100.0 0.0 1:31.42 postmaster
> 31248 postgres 20 0 131g 90m 65m R 100.0 0.0 1:26.20 postmaster
> 34934 postgres 20 0 131g 5580 3456 R 100.0 0.0 1:23.96 postmaster
> 47945 postgres 20 0 131g 3.0g 3.0g R 100.0 1.2 6:08.41 postmaster
> 16116 postgres 20 0 131g 84m 59m R 100.0 0.0 1:30.60 postmaster
> 16304 postgres 20 0 131g 85m 60m R 100.0 0.0 1:38.89 postmaster
> 17104 postgres 20 0 131g 96m 72m R 100.0 0.0 1:27.54 postmaster
> 17111 postgres 20 0 131g 98m 73m R 100.0 0.0 1:38.23 postmaster
> 17320 postgres 20 0 131g 98m 74m R 100.0 0.0 1:38.51 postmaster
> 31221 postgres 20 0 131g 63m 38m R 100.0 0.0 1:33.63 postmaster
> 31272 postgres 20 0 131g 1.0g 1.0g R 100.0 0.4 1:32.71 postmaster
> 3290 postgres 20 0 131g 99m 74m R 100.0 0.0 1:32.76 postmaster
> 3459 postgres 20 0 131g 2.1g 2.0g R 100.0 0.8 1:44.92 postmaster
> 16492 postgres 20 0 131g 100m 75m R 100.0 0.0 1:33.36 postmaster
> 16562 postgres 20 0 131g 114m 89m R 100.0 0.0 1:35.14 postmaster
> 17146 postgres 20 0 131g 91m 66m R 100.0 0.0 1:37.39 postmaster
> 17403 postgres 20 0 131g 98m 73m R 100.0 0.0 1:32.13 postmaster
> 31100 postgres 20 0 131g 62m 38m R 100.0 0.0 1:29.06 postmaster
> 2019 postgres 20 0 131g 1.2g 1.2g R 98.7 0.5 1:40.91 postmaster
> 2150 postgres 20 0 131g 1.3g 1.3g R 98.7 0.5 2:53.14 postmaster
> 16048 postgres 20 0 131g 71m 46m R 98.7 0.0 1:29.75 postmaster
> 30190 postgres 20 0 131g 1.4g 1.3g R 98.7 0.5 0:55.98 postmaster
> 16112 postgres 20 0 131g 862m 827m R 97.1 0.3 0:48.00 postmaster
> 31202 postgres 20 0 131g 74m 49m R 97.1 0.0 1:34.62 postmaster
> 35658 postgres 20 0 131g 5948 3788 R 97.1 0.0 0:12.29 postmaster
> 16134 postgres 20 0 131g 1.9g 1.9g R 95.4 0.8 1:47.27 postmaster
> 31034 postgres 20 0 131g 69m 44m R 95.4 0.0 1:26.35 postmaster
> 16120 postgres 20 0 131g 1.2g 1.2g R 93.8 0.5 2:04.02 postmaster
> 30891 postgres 20 0 131g 57m 33m R 93.8 0.0 1:23.08 postmaster
> 31261 postgres 20 0 131g 81m 56m R 93.8 0.0 1:24.51 postmaster
> 29790 postgres 20 0 131g 62m 37m R 92.2 0.0 1:35.34 postmaster
> 30426 postgres 20 0 131g 62m 37m R 87.4 0.0 1:34.51 postmaster
> 30857 postgres 20 0 131g 50m 26m R 79.3 0.0 1:37.82 postmaster
> 507 root 39 19 0 0 0 R 67.9 0.0 19:19.71 khugepaged
> 16095 postgres 20 0 131g 83m 58m R 67.9 0.0 1:27.64 postmaster
> 30856 postgres 20 0 131g 69m 44m R 67.9 0.0 1:34.46 postmaster
> 17442 postgres 20 0 131g 2.4g 2.4g S 11.3 0.9 1:02.14 postmaster

Postgresql Version information:
- PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6), 64-bit
- Running Hot Replication to another node (same hardware setup there)

Server Hardware:
- 4x 12 Core AMD Magny cours
- 256 GB of RAM (36% currently used)
- 1,3 TB SAS Raid (LSI Raid controller) - 15k rpm

If i lost to include some important informations for you analyzing my problem, let me please know. I did my best to post the question as accurate as possible for me.

--
Mit freundlichen Grüßen

Paul Dunkler

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Richard Huxton 2012-03-01 17:02:04 Re: 100% cpu usage on some postmaster processes kill the complete database
Previous Message Scott Marlowe 2012-03-01 15:21:47 Re: what Linux to run