Re: FW: Queries becoming slow under heavy load

From: "Anne Rosset" <arosset(at)collab(dot)net>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: FW: Queries becoming slow under heavy load
Date: 2011-01-28 05:12:36
Message-ID: 945629628BB0174D86709AFE6D1CDEF501634373@SP-EXCHMBC.sp.corp.collab.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Scott,
Thanks for your response.
We are over NFS for our storage ...

Here is what we see during our performance testing:
This is about 7 seconds after the query was sent to postgres:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7090 root 25 0 689m 399m 10m R 89.9 5.0 3868:44 java
1846 postgres 16 0 474m 198m 103m R 75.2 2.5 0:28.69 postmaster
2170 postgres 15 0 391m 203m 188m R 44.0 2.6 0:17.63 postmaster
2555 httpd 18 0 298m 15m 4808 R 22.0 0.2 0:00.12 httpd
2558 root 15 0 29056 2324 1424 R 1.8 0.0 0:00.01 top
1207 httpd 15 0 337m 20m 7064 R 0.0 0.3 0:00.69 httpd
28312 postgres 16 0 396m 183m 162m D 0.0 2.3 0:50.82 postmaster <---- this is the query here

Notice the 0% CPU, also, notice the 183m RES memory.

Ten seconds later:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7090 root 25 0 689m 399m 10m R 92.9 5.0 3868:53 java
2657 root 15 0 29056 2328 1424 R 1.9 0.0 0:00.01 top
28312 postgres 16 0 396m 184m 162m D 0.0 2.3 0:50.84 postmaster <---- here

Ten seconds after that:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7090 root 25 0 689m 399m 10m R 88.7 5.0 3869:02 java
1845 postgres 16 0 473m 223m 127m D 22.6 2.8 0:26.39 postmaster
2412 httpd 15 0 2245m 1.4g 16m R 18.9 17.8 0:02.48 java
966 postgres 15 0 395m 242m 221m D 0.0 3.0 1:02.31 postmaster
2680 root 15 0 29056 2336 1424 R 0.0 0.0 0:00.01 top
28312 postgres 16 0 396m 184m 163m D 0.0 2.3 0:50.85 postmaster <--- here

etc....

and it's not until around the 221 second mark that we see catch it consuming CPU:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java
28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 postmaster <----- here
3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top
4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 hald-addon-stor
26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java

Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU.

Looks like this is true "halting".

Further, or worse yet, this same behavior expands out to multiple processes, producing a true "back up". It can look
something like this. Notice the 0% cpu consumption:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java
4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top
1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 postmaster
1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 postmaster
2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 postmaster
2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 postmaster
2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 postmaster
2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 postmaster

Thanks for any additional explanation/advice,
Anne

-----Original Message-----
From: Scott Marlowe [mailto:scott(dot)marlowe(at)gmail(dot)com]
Sent: Wednesday, January 26, 2011 8:19 PM
To: Anne Rosset
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load

On Wed, Jan 26, 2011 at 9:04 AM, Anne Rosset <arosset(at)collab(dot)net> wrote:

<HUGE LIST OF SETTINGS DELETED>

PLEASE post just the settings you changed. I'm not searching through a list that big for the interesting bits.

> Today we did more analysis and observed  postgress processes that
> continually reported status 'D' in top.

Full stop. The most likely problem here is that the query is now hitting the disks and waiting. If you have 1 disk and two users, the access speed will drop by factors, usually much higher than 2.

To put it very simply, you need as many mirror pairs in your RAID-10 or as many disks in your RAID5 or RAID 6 as you have users reading the disk drives. If you're writing you need more and more disks too.
Mediating this issue we find things like SSD cache in ZFS or battery backed RAID controllers. They allow the reads and writes to be streamlined quite a bit to the spinning disks, making it appear the RAID array underneath it was much faster, had better access, and all
the sectors were near each other. To an extent.

If you have the answer to the previous poster's question "can you tell us what sort of IO you have (sata, scsi, raid, # of disks, etc)." you should provide it. If you've got a pair of 5k RPM SATA drives in a
RAID-1 you might need more hardware.

So, instead of just settings, show us a few carefully selected lines of output from vmstat or iostat while this is happening. Don't tell us what you see, show us.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Kohl 2011-01-28 08:46:29 Re: High load,
Previous Message david 2011-01-28 00:53:00 Re: How to best use 32 15k.7 300GB drives?