Skip site navigation (1) Skip section navigation (2)

Re: 8.3.9 - latency spikes with Linux (and tuning for consistently low latency)

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Marinos Yannikos <mjy(at)geizhals(dot)at>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: 8.3.9 - latency spikes with Linux (and tuning for consistently low latency)
Date: 2010-04-15 22:45:03
Message-ID: 4BC796EF.5030902@2ndquadrant.com (view raw or flat)
Thread:
Lists: pgsql-performance
Marinos Yannikos wrote:
> vm.dirty_ratio = 80

This is tuned the opposite direction of what you want.  The default 
tuning in the generation of kernels you're using is:

/proc/sys/vm/dirty_ratio = 10
/proc/sys/vm/dirty_background_ratio = 5

And those should be considered upper limits if you want to tune for latency.

Unfortunately, even 5% will still allow 1.6GB of dirty data to queue up 
without being written given 32GB of RAM, which is still plenty to lead 
to a multi-second pause at times.

> 3 DB clusters, 2 of which are actively used, all on the same 
> [software] RAID-1 FS

So your basic problem here is that you don't have enough disk I/O to 
support this load.  You can tune it all day and that fundamental issue 
will never go away.  You'd need a battery-backed write controller 
capable of hardware RAID to even have a shot at supporting a system with 
this much RAM without long latency pauses.  I'd normally break out the 
WAL onto a separate volume too.

> [nothing for a few minutes]
> 2010-04-15 16:50:03 CEST LOG:  duration: 8995.934 ms  statement: 
> select ...
> 2010-04-15 16:50:04 CEST LOG:  duration: 3383.780 ms  statement: 
> select ...
> 2010-04-15 16:50:04 CEST LOG:  duration: 3328.523 ms  statement: 
> select ...
> 2010-04-15 16:50:05 CEST LOG:  duration: 1120.108 ms  statement: 
> select ...
> 2010-04-15 16:50:05 CEST LOG:  duration: 1079.879 ms  statement: 
> select ...
> [nothing for a few minutes]

Guessing five minutes each time?  You should turn on checkpoint_logs to 
be sure, but I'd bet money that's the interval, and that these are 
checkpoint spikes.  If the checkpoing log shows up at about the same 
time as all these queries that were blocking behind it, that's what 
you've got.

> shared_buffers=5GB (database size is ~4.7GB on disk right now)

The best shot you have at making this problem a little better just with 
software tuning is to reduce this to something much smaller; 128MB - 
256MB would be my starting suggestion.  Make sure checkpoint_segments is 
still set to a high value.

The other thing you could try is to tune like this:

checkpoint_segments=256MB
checkpoint_timeout=20min

Which would get you 4X as much checkpoint spreading as you have now.

> fsync=off

This is just generally a bad idea.

> work_mem=500MB
> wal_buffers=256MB (*)
> commit_delay=100000 (*)

That's way too big a value for work_mem; there's no sense making 
wal_buffers bigger than 16MB; and you shouldn't ever adjust 
commit_delay.  It's a mostly broken feature that might even introduce 
latency issues in your situation.  None of these are likely related to 
your problem today though.

> I am suspecting some strange software RAID or kernel problem, unless 
> the default bgwriter settings can actually cause selects to get stuck 
> for so long when there are too many dirty buffers (I hope not).

This fairly simple:  your kernel is configured to allow the system to 
cache hundreds of megabytes, if not gigabytes, of writes.  There is no 
way to make that go completely away because the Linux kernel has an 
unfortunate design in terms of being low latency.  I've written two 
papers in this area:

http://www.westnet.com/~gsmith/content/linux-pdflush.htm
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

And I doubt I could get the worst case on these tuned down to under a 
second using software RAID without a proper disk controller.  
Periodically, the database must get everything in RAM flushed out to 
disk, and the only way to make that happen instantly is for there to be 
a hardware write cache to dump it into, and the most common way to get 
one of those is to buy a hardware RAID card.

> Unless I'm missing something, I only have a non-RAID setup or ramdisks 
> (tmpfs), or SSDs left to try to get rid of these

Battery-backed write caching controller, and then re-tune afterwards.  
Nothing else will improve your situation very much.  SSDs have their own 
issues under heavy writes and the RAID has nothing to do with your 
problem.  If this is disposable data and you can run from a RAM disk, 
now that would work, but now you've got some serious work to do in order 
to make that persistent.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com   www.2ndQuadrant.us


In response to

pgsql-performance by date

Next:From: Tom LaneDate: 2010-04-15 22:48:17
Subject: Re: Autovaccum with cost_delay does not complete on one solaris 5.10 machine
Previous:From: Tom LaneDate: 2010-04-15 22:21:10
Subject: Re: stats collector suddenly causing lots of IO

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group