Skip site navigation (1) Skip section navigation (2)

Re: Effects of setting linux block device readahead size

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Scott Carey <scott(at)richrelevance(dot)com>
Cc: Mark Wong <markwkm(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Gabrielle Roth <gorthx(at)gmail(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>
Subject: Re: Effects of setting linux block device readahead size
Date: 2008-09-11 04:12:58
Message-ID: Pine.GSO.4.64.0809101951190.18076@westnet.com (view raw or flat)
Thread:
Lists: pgsql-performance
On Wed, 10 Sep 2008, Scott Carey wrote:

> Ok, so this is a drive level parameter that affects the data going into the
> disk cache?  Or does it also get pulled over the SATA/SAS link into the OS
> page cache?

It's at the disk block driver level in Linux, so I believe that's all 
going into the OS page cache.  They've been rewriting that section a bit 
and I haven't checked it since that change (see below).

> Additionally, I would like to know how this works with hardware RAID -- Does
> it set this value per disk?

Hardware RAID controllers usually have their own read-ahead policies that 
may or may not impact whether the OS-level read-ahead is helpful.  Since 
Mark's tests are going straight into the RAID controller, that's why it's 
helpful here, and why many people don't ever have to adjust this 
parameter.  For example, it doesn't give a dramatic gain on my Areca card 
even in JBOD mode, because that thing has its own cache to manage with its 
own agenda.

Once you start fiddling with RAID stripe sizes as well the complexity 
explodes, and next thing you know you're busy moving the partition table 
around to make the logical sectors line up with the stripes better and 
similar exciting work.

> Additionally, the O/S should have a good heuristic based read-ahead process
> that should make the drive/device level read-ahead much less important.  I
> don't know how long its going to take for Linux to do this right:
> http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
> http://kerneltrap.org/node/6642

That was committed in 2.6.23:

http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

but clearly some larger minimum hints still helps, as the system we've 
been staring at benchmarks has that feature.

> Some chunk of data in that seek is free, afterwords it is surely not...

You can do a basic model of the drive to get a ballpark estimate on these 
things like I threw out, but trying to break down every little bit gets 
hairy.  In most estimation cases you see, where 128kB is the amount being 
read, the actual read time is so small compared to the rest of the numbers 
that it just gets ignored.

I was actually being optimistic about how much cache can get filled by 
seeks.  If the disk is spinning at 15000RPM, that's 4ms to do a full 
rotation.  That means that on average you'll also wait 2ms just to get the 
heads lined up to read that one sector on top of the 4ms seek to get in 
the area; now we're at 6ms before you've read anything, topping seeks out 
at under 167/second.  That number--average seek time plus half a 
rotation--is what a lot of people call the IOPS for the drive.  There, 
typically the time spent actually reading data once you've gone through 
all that doesn't factor in.  IOPS is not very well defined, some people 
*do* include the reading time once you're there; one reason I don't like 
to use it.  There's a nice chart showing some typical computations here at 
http://www.dbasupport.com/oracle/ora10g/disk_IO_02.shtml if anybody wants 
to see how this works for other classes of disk.  The other reason I don't 
like focusing too much on IOPS (some people act like it's the only 
measurement that matters) is that it tells you nothing about the 
sequential read rate, and you have to consider both at once to get a clear 
picture--particularly when there are adjustments that impact those two 
oppositely, like read-ahead.

As far as the internal transfer speed of the heads to the drive's cache 
once it's lined up, those are creeping up toward the 200MB/s range for the 
kind of faster drives the rest of these stats come from.  So the default 
of 128kB is going to take 0.6ms, while a full 1MB might take 5ms.  You're 
absolutely right to question how hard that will degrade seek performance; 
these slightly more accurate numbers suggest that might be as bad as going 
from 6.6ms to 11ms per seek, or from 150 IOPS to 91 IOPS.  It also points 
out how outrageously large the really big read-ahead numbers are once 
you're seeking instead of sequentially reading.

One thing it's hard to know is how much read-ahead the drive was going to 
do on its own, no matter what you told it, anyway as part of its caching 
algorithm.

> I suppose I should learn more about pgbench.

Most people use it as just a simple benchmark that includes a mixed 
read/update/insert workload.  But that's internally done using a little 
command substition "language" that let's you easily write things like 
"generate a random number between 1 and 1M, read the record from this 
table, and then update this associated record" that scale based on how big 
the data set you've given it is.  You an write your own scripts in that 
form too.  And if you specify several scripts like that at a time, it will 
switch between them at random, and you can analyze the average execution 
time broken down per type if you save the latency logs. Makes it real easy 
to adjust the number of clients and the mix of things you have them do.

The main problem: it doesn't scale to large numbers of clients very well. 
But it can easily simulate 50-100 banging away at a time which is usually 
enough to rank filesystem concurrency capabilities, for example.  It's 
certainly way easier to throw together a benchmark using it that is 
similar to an abstract application than it is to try and model multi-user 
database I/O using fio.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

In response to

pgsql-performance by date

Next:From: James MansionDate: 2008-09-11 05:21:17
Subject: Re: Effects of setting linux block device readahead size
Previous:From: jayDate: 2008-09-11 01:23:04
Subject: 答复: [PERFORM] Improve COPY performance

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group