Re: Effects of setting linux block device readahead size

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: "Mark Wong" <markwkm(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Gabrielle Roth" <gorthx(at)gmail(dot)com>, "Selena Deckelmann" <selenamarie(at)gmail(dot)com>
Subject: Re: Effects of setting linux block device readahead size
Date: 2008-09-11 19:07:25
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hmm, I would expect this tunable to potentially be rather file system
dependent, and potentially raid controller dependant. The test was using
ext2, perhaps the others automatically prefetch or read ahead? Does it
vary by RAID controller?

Well I went and found out, using ext3 and xfs. I have about 120+ data
points but here are a few interesting ones before I compile the rest and
answer a few other questions of my own.

1: readahead does not affect "pure" random I/O -- there seems to be a
heuristic trigger -- a single process or file probably has to request a
sequence of linear I/O of some size to trigger it. I set it to over 64MB of
read-ahead and random iops remained the same to prove this.
2: File system matters more than you would expect. XFS sequential
transfers when readahead was tuned had TWICE the sequential throughput of
ext3, both for a single reader and 8 concurrent readers on 8 different
3: The RAID controller and its configuration make a pretty significant
difference as well.

12 7200RPM SATA (Seagate) in raid 10 on 3Ware 9650 (only ext3)
12 7200RPM SATA ('nearline SAS' : Seagate ES.2) on PERC 6 in raid 10 (ext3,
I also have some results with PERC raid 10 with 4x 15K SAS, not reporting in
this message though

Testing process:
All tests begin with
#sync; echo 3 > /proc/sys/vm/drop_caches;
followed by
#blockdev --setra XXX /dev/sdb
Even though FIO claims that it issues reads that don't go to cache, the
read-ahead DOES go to the file system cache, and so one must drop them to
get consistent results unless you disable the read-ahead. Even if you are
reading more than 2x the physical RAM, that first half of the test is
distorted. By flushing the cache first my results became consistent within
about +-2%.

-- fio, read 8 files concurrently, sequential read profile, one process per
; this will be total of all individual files per process
; this is number of files total per process

-- fio, read one large file sequentially with one process
; this will be total of all individual files per process
; this is number of files total per process

-- 'dd' in a few ways:
Measure direct to partition / disk read rate at the start of the disk:
'dd if=/dev/sdb of=/dev/null ibs=24M obs=64K'
Measure direct to partition / disk read rate near the end of the disk:
'dd if=/dev/sdb1 of=/dev/null ibs=24M obs=64K skip=160K'
Measure direct read of the large file used by the FIO one sequential file
'dd if=/data/test/seq-read.1.0 of=/dev/null ibs=32K obs=32K'

the dd paramters for block sizes were chosen with much experimentation to
get the best result.

I've got a lot of results, I'm only going to put a few of them here for now
while I investigate a few other things (see the end of this message)
Preliminary summary:

PERC 6, ext3, full partition.
dd beginning of disk : 642MB/sec
dd end of disk: 432MB/sec
dd large file (readahead 49152): 312MB/sec
-- maximum expected sequential capabilities above?

fio: 8 concurrent readers and 1 concurrent reader results
readahead is in 512 byte blocks, sequential transfer rate in MiB/sec as
reported by fio.

readahead | 8 conc read rate | 1 conc read rate
49152 | 311 | 314
16384 | 312 | 312
12288 | 304 | 309
8192 | 292 |
4096 | 264 |
2048 | 211 |
1024 | 162 | 302
512 | 108 |
256 | 81 | 300
8 | 38 |

Conclusion, on this array going up to 12288 (6MB) readahead makes a huge
impact on concurrent sequential reads. That is 1MB per raid slice (6, 12
disks raid 10). Sequential read performance under concurrent. It has
almost no impact at all on one sequential read alone, the OS or the RAID
controller are dealing with that case just fine.

But, how much of the above effect is ext3? How much is it the RAID card?
At the top end, the sequential rate for both concurrent and single
sequential access is in line with what dd can get going through ext3. But
it is not even close to what you can get going right to the device and
bypassing the file system.

Lets try a different RAID card first. The disks aren't exactly the same,
and there is no guarantee that the file is positioned near the beginning or
end, but I've got another 12 disk RAID 10, using a 3Ware 9650 card.

Results, as above -- don't conclude this card is faster, the files may have
just been closer to the front of the partition.
dd, beginning of disk: 522MB/sec
dd, end of disk array: 412MB/sec
dd, file read via file system (readahead 49152): 391MB/sec

readahead | 8 conc read rate | 1 conc read rate
49152 | 343 | 392
16384 | 349 | 379
12288 | 348 | 387
8192 | 344 |
6144 | | 376
4096 | 340 |
2048 | 319 |
1024 | 284 | 371
512 | 239 | 376
256 | 204 | 377
128 | 169 | 386
8 | 47 | 382

Conclusion, this RAID controller definitely behaves differently: It is much
less sensitive to the readahead. Perhaps it has a larger stripe size? Most
likely, this one is set up with a 256K stripe, the other one I do not know,
though the PERC 6 default is 64K which may be likely.

Ok, so the next question is how file systems play into this.
First, I ran a bunch of tests with xfs, and the results were rather odd.
That is when I realized that the platter speeds at the start and end of the
arrays is significantly different, and xfs and ext3 will both make different
decisions on where to put the files on an empty partition (xfs will spread
them evenly, ext3 more close together but still somewhat random on the
actual position).

so, i created a partition that was roughly 10% the size of the whole thing,
at the beginning of the array.

Using the PERC 6 setup, this leads to:
dd, against partition: 660MB/sec max result, 450MB/sec min -- not a reliable
test for some reason
dd, against file on the partition (ext3): 359MB/sec

ext3 (default settings):
readahead | 8 conc read rate | 1 conc read rate
49152 | 363 |
12288 | 359 |
6144 | 319 |
1024 | 176 |
256 | |

Analysis: I only have 8 concurrent read results here, as these are the most
interesting based on the results from the whole disk tests above. I also
did not collect a lot of data points.
What is clear, is that the partition at the front does make a difference,
compared to the whole partition results we have about 15% more throughput on
the 8 concurrent read test, meaning that ext3 probably put the files in the
whole disk case near the middle of the drive geometry.
The 8 concurrent read test has the same "break point" at about 6MB read
ahead buffer, which is also consistent.

And now, for XFS, a full result set and VERY surprising results. I dare
say, the benchmarks that led me to do these tests are not complete without
XFS tests:

xfs (default settings):
readahead | 8 conc read rate | 1 conc read rate
98304 | 651 | 640
65536 | 636 | 609
49152 | 621 | 595
32768 | 602 | 565
24576 | 595 | 548
16384 | 560 | 518
12288 | 505 | 480
8192 | 437 | 394
6144 | 412 | 415 *
4096 | 357 | 281 *
3072 | 329 | 338
2048 | 259 | 383
1536 | 230 | 445
1280 | 207 | 542
1024 | 182 | 605 *
896 | 167 | 523
768 | 148 | 456
512 | 119 | 354
256 | 88 | 303
64 | 60 | 171
8 | 36 | 55

* these local max and mins for the sequential transfer were tested several
times to validate. They may have something to do with me not tuning the
inode layout for an array using the xfs stripe unit and stripe width

dd, on the file used in the single reader sequential read test:
660MB/sec. One other result for the sequential transfer, using a gigantic
393216 (192MB) readahead:
672 MB/sec.

XFS gets significantly higher sequential (read) transfer rates than ext3.
It had higher write results but I've only done one of those.
Both ext3 and xfs can be tuned a bit more, mainly with noatime and some
parameters so they know about the geometry of the raid array.

Other misc results:
I used the deadline scheduler, it didn't impact the results here.
I ran some tests to "feel out" the sequential transfer rate sensitivity to
readahead for a 4x 15K RPM SAS raid setup -- it is less sensitive:
ext3, 8 concurrent reads -- readahead = 256, 195MB/sec; readahead = 3072,
200MB/sec; readahead = 32768, 210MB/sec; readahead =64, 120MB/sec
On the 3ware setup, with ext3, postgres was installed and a select count(1)
from table reported between 300 and 320 MB/sec against tables larger than
5GB, and disk utilization was about 88%. dd can get 390 with the settings
used (readahead 12288).
Setting the readahead back to the default, postgres gets about 220MB/sec at
100% disk util on similar tables. I will be testing out xfs on this same
data eventually, and expect it to provide significant gains there.

Remaining questions:
Readahead does NOT activate for pure random requests, which is a good
thing. The question is, when does it activate? I'll have to write some
custom fio tests to find out. I suspect that when the OS detects either: X
number of sequential requests on the same file (or from the same process),
it activates. OR after sequential acces of at least Y bytes. I'll report
results once I know, to construct some worst case scenarios of using a large
I will also measure its affect when mixed random access and streaming reads

On Wed, Sep 10, 2008 at 7:49 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> On Tue, 9 Sep 2008, Mark Wong wrote:
> I've started to display the effects of changing the Linux block device
>> readahead buffer to the sequential read performance using fio.
> Ah ha, told you that was your missing tunable. I'd really like to see the
> whole table of one disk numbers re-run when you get a chance. The reversed
> ratio there on ext2 (59MB read/92MB write) was what tipped me off that
> something wasn't quite right initially, and until that's fixed it's hard to
> analyze the rest.
> Based on your initial data, I'd say that the two useful read-ahead settings
> for this system are 1024KB (conservative but a big improvement) and 8192KB
> (point of diminishing returns). The one-disk table you've got (labeled with
> what the default read-ahead is) and new tables at those two values would
> really flesh out what each disk is capable of.
> --
> * Greg Smith gsmith(at)gregsmith(dot)com Baltimore, MD
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:

In response to


Browse pgsql-performance by date

  From Date Subject
Next Message Craig James 2008-09-11 19:53:36 Re: Choosing a filesystem
Previous Message Scott Marlowe 2008-09-11 18:32:12 Re: Choosing a filesystem