Re: FW: Queries becoming slow under heavy load

From: "Anne Rosset" <arosset(at)collab(dot)net>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: FW: Queries becoming slow under heavy load
Date: 2011-02-03 17:40:02
Message-ID: 945629628BB0174D86709AFE6D1CDEF501634479@SP-EXCHMBC.sp.corp.collab.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks to all of you who replied and pointed NFS as a potential
culprit.
Our issue was that pgsql's temp dir (pgsql_tmp) was set to the default
value ( $PSQL_DIR/base/pgsql_tmp/) which was located in NFS.
Moving the temp dir to local disk got us a huge improvement.

Anne

-----Original Message-----
From: Shaun Thomas [mailto:sthomas(at)peak6(dot)com]
Sent: Friday, January 28, 2011 7:31 AM
To: Anne Rosset
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load

On 01/27/2011 11:12 PM, Anne Rosset wrote:

> Thanks for your response.
> We are over NFS for our storage ...

NFS? I'm not sure you know this, but NFS has major locking issues that
would make it a terrible candidate for hosting a database.

> and it's not until around the 221 second mark that we see catch it
consuming CPU:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java
> 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52
postmaster<----- here
> 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top
> 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66
hald-addon-stor
> 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java
>
> Note that the load average is fine during this timeframe, ~4 out of 8,
so plenty of CPU.

Please listen to us. We asked you to use sar, or iostat, to tell us how
much the disk IO is being utilized. From your other screenshots, there
were at least two other PG processes that were running and could have
been thrashing the disk or locking tables your "slow" query needed. If
it's waiting for disk IO, the CPU will remain low until it gets what it
needs.

Not everything is about the CPU. Especially now that we know your DB is
running on top of NFS.

> Further, or worse yet, this same behavior expands out to multiple
> processes, producing a true "back up". It can look something like
> this. Notice the 0% cpu consumption:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java
> 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top
> 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32
postmaster
> 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43
postmaster
> 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54
postmaster
> 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48
postmaster
> 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41
postmaster
> 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12
postmaster

Yes. And they could all be waiting for IO. Or NFS locking is blocking
the reads. Or... what is that Java app doing? We don't know the state of
your IO, and when you have 0% or very low CPU usage, you either have
locking contention or you're being IO starved.

And what queries are these connections performing? You can check it by
getting the contents of the pg_stat_activity system view. If they're
selecting and still "slow", compare that against the iostat or sar
results. For instance, here's an IOSTAT of our system:

iostat -dmx dm-9 1

Linux 2.6.18-92.el5 (oslchi6pedb1) 01/28/2011

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
dm-9 0.00 0.00 125.46 227.78 4.95 0.89 33.88
0.08 0.19 0.08 2.91

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
dm-9 0.00 0.00 5.00 0.00 0.04 0.00 14.40
0.05 10.60 10.60 5.30

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
dm-9 0.00 0.00 2.00 0.00 0.02 0.00 16.00
0.01 7.00 7.00 1.40

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
dm-9 0.00 0.00 4.00 1184.00 0.04 4.62 8.04
27.23 11.73 0.06 6.80

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
dm-9 0.00 0.00 11.00 847.00 0.09 3.31 8.10
29.31 49.65 0.79 67.90

That last column, %util, effectively tells us how saturated the
controller is. If the percentage is high, it's really working hard to
supply the data we're asking for, or trying to write. If it's low, we're
probably working from memory cache, or getting less requests. There have
been times our queries are "slow" and when we check this stat, it's
often at or above 90%, sometimes for minutes at a time. That's almost
always a clear indicator you have IO contention. Queries can't work
without the data they need to return your results.

Sending us more CPU charts isn't going to help us in helping you.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas(at)peak6(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Chris Browne 2011-02-03 17:44:23 Re: [HACKERS] Slow count(*) again...
Previous Message Robert Haas 2011-02-03 17:28:29 Re: [HACKERS] Slow count(*) again...