Random performance hit, unknown cause.

From: Brian Fehrle <brianf(at)consistentstate(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Random performance hit, unknown cause.
Date: 2012-04-12 18:41:08
Message-ID: 4F8721C4.8090300@consistentstate.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

OS: Linux 64 bit 2.6.32
PostgreSQL 9.0.5 installed from Ubuntu packages.
8 CPU cores
64 GB system memory
Database cluster is on raid 10 direct attached drive, using a HP p800
controller card.

I have a system that has been having occasional performance hits, where
the load on the system skyrockets, all queries take longer to execute
and a hot standby slave I have set up via streaming replication starts
to get behind. I'm having trouble pinpointing where the exact issue is.

This morning, during our nightly backup process (where we grab a copy of
the data directory), we started having this same issue. The main thing
that I see in all of these is a high disk wait on the system. When we
are performing 'well', the %wa from top is usually around 30%, and our
load is around 12 - 15. This morning we saw a load 21 - 23, and an %wa
jumping between 60% and 75%.

The top process pretty much at all times is the WAL Sender Process, is
this normal?

From what I can tell, my access patterns on the database has not
changed, same average number of inserts, updates, deletes, and had
nothing on the system changed in any way. No abnormal autovacuum
processes that aren't normally already running.

So what things can I do to track down what an issue is? Currently the
system has returned to a 'good' state, and performance looks great. But
I would like to know how to prevent this, as well as be able to grab
good stats if it does happen again in the future.

Has anyone had any issues with the HP p800 controller card in a postgres
environment? Is there anything that can help us maximise the performance
to disk in this case, as it seems to be one of our major bottlenecks? I
do plan on moving the pg_xlog to a separate drive down the road, the
cluster is extremely active so that will help out a ton.

some IO stats:

$ iostat -d -x 5 3
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
dev1 1.99 75.24 651.06 438.04 41668.57 8848.18
46.38 0.60 3.68 0.70 76.36
dev2 0.00 0.00 653.05 513.43 41668.57 8848.18
43.31 2.18 4.78 0.65 76.35

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
dev1 0.00 35.20 676.20 292.00 35105.60 5688.00
42.13 67.76 70.73 1.03 100.00
dev2 0.00 0.00 671.80 295.40 35273.60 4843.20
41.48 73.41 76.62 1.03 100.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
dev1 1.20 40.80 865.40 424.80 51355.20 8231.00
46.18 37.87 29.22 0.77 99.80
dev2 0.00 0.00 867.40 465.60 51041.60 8231.00
44.47 38.28 28.58 0.75 99.80

Thanks in advance,
Brian F

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2012-04-12 18:49:43 Re: Random performance hit, unknown cause.
Previous Message Steve Crawford 2012-04-12 15:47:58 Re: Linux machine aggressively clearing cache