Great info Greg,
Some follow-up questions and information in-line:
On Wed, Sep 10, 2008 at 12:44 PM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> On Wed, 10 Sep 2008, Scott Carey wrote:
> How does that readahead tunable affect random reads or mixed random /
>> sequential situations?
> It still helps as long as you don't make the parameter giant. The read
> cache in a typical hard drive noawadays is 8-32MB. If you're seeking a lot,
> you still might as well read the next 1MB or so after the block requested
> once you've gone to the trouble of moving the disk somewhere. Seek-bound
> workloads will only waste a relatively small amount of the disk's read cache
> that way--the slow seek rate itself keeps that from polluting the buffer
> cache too fast with those reads--while sequential ones benefit enormously.
> If you look at Mark's tests, you can see approximately where the readahead
> is filling the disk's internal buffers, because what happens then is the
> sequential read performance improvement levels off. That looks near 8MB for
> the array he's tested, but I'd like to see a single disk to better feel that
> out. Basically, once you know that, you back off from there as much as you
> can without killing sequential performance completely and that point should
> still support a mixed workload.
> Disks are fairly well understood physical components, and if you think in
> those terms you can build a gross model easily enough:
> Average seek time: 4ms
> Seeks/second: 250
> Data read/seek: 1MB (read-ahead number goes here)
> Total read bandwidth: 250MB/s
> Since that's around what a typical interface can support, that's why I
> suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's
> pretty close to optimal for sequential as well here (big improvement from
> the default Linux RA of 256 blocks=128K). If you know your work is biased
> heavily toward sequential scans, you might pick the 8MB read-ahead instead.
> That value (--setra=16384 -> 8MB) has actually been the standard "start
> here" setting 3ware suggests on Linux for a while now:
Ok, so this is a drive level parameter that affects the data going into the
disk cache? Or does it also get pulled over the SATA/SAS link into the OS
page cache? I've been searching around with google for the answer and can't
seem to find it.
Additionally, I would like to know how this works with hardware RAID -- Does
it set this value per disk? Does it set it at the array level (so that 1MB
with an 8 disk stripe is actually 128K per disk)? Is it RAID driver
dependant? If it is purely the OS, then it is above raid level and affects
the whole array -- and is hence almost useless. If it is for the whole
array, it would have horrendous negative impact on random I/O per second if
the total readahead became longer than a stripe width -- if it is a full
stripe then each I/O, even those less than the size of a stripe, would cause
an I/O on every drive, dropping the I/O per second to that of a single
If it is a drive level setting, then it won't affect i/o per sec by making
i/o's span multiple drives in a RAID, which is good.
Additionally, the O/S should have a good heuristic based read-ahead process
that should make the drive/device level read-ahead much less important. I
don't know how long its going to take for Linux to do this right:
Lets expand a bit on your model above for a single disk:
A single disk, with 4ms seeks, and max disk throughput of 125MB/sec. The
interface can transfer 300MB/sec.
250 seeks/sec. Some chunk of data in that seek is free, afterwords it is
512KB can be read in 4ms then. A 1MB read-ahead would result in:
4ms seek, 8ms read. 1MB seeks/sec ~=83 seeks/sec.
However, some chunk of that 1MB is "free" with the seek. I'm not sure how
much per drive, but it is likely on the order of 8K - 64K.
I suppose I'll have to experiment in order to find out. But I can't see how
a 1MB read-ahead, which should take 2x as long as seek time to read off the
platters, could not have significant impact on random I/O per second on
single drives. For SATA drives the transfer rate to seek time ratio is
smaller, and their caches are bigger, so a larger read-ahead will impact
> I would be very interested in a mixed fio profile with a "background
>> doing moderate, paced random and sequential writes combined with
>> sequential reads and random reads.
> Trying to make disk benchmarks really complicated is a path that leads to a
> lot of wasted time. I one made this gigantic design plan for something that
> worked like the PostgreSQL buffer management system to work as a disk
> benchmarking tool. I threw it away after confirming I could do better with
> carefully scripted pgbench tests.
> If you want to benchmark something that looks like a database workload,
> benchmark a database workload. That will always be better than guessing
> what such a workload acts like in a synthetic fashion. The "seeks/second"
> number bonnie++ spits out is good enough for most purposes at figuring out
> if you've detuned seeks badly.
> "pgbench -S" run against a giant database gives results that look a lot
> like seeks/second, and if you mix multiple custom -f tests together it will
> round-robin between them at random...
I suppose I should learn more about pgbench. Most of this depends on how
much time it takes to do one versus the other. In my case, setting up the
DB will take significantly longer than writing 1 or 2 more fio profiles. I
categorize mixed load tests as basic test -- you don't want to uncover
configuration issues after the application test that running a mix of
read/write and sequential/random could have uncovered with a simple test.
This is similar to increasing the concurrency. Some file systems deal with
concurrency much better than others.
> It's really helpful to measure these various disk subsystem parameters
> individually. Knowing the sequential read/write, seeks/second, and commit
> rate for a disk setup is mainly valuable at making sure you're getting the
> full performance expected from what you've got. Like in this example, where
> something was obviously off on the single disk results because reads were
> significantly slower than writes. That's not supposed to happen, so you
> know something basic is wrong before you even get into RAID and such. Beyond
> confirming whether or not you're getting approximately what you should be
> out of the basic hardware, disk benchmarks are much less useful than
> application ones.
Absolutely -- its critical to run the synthetic tests, and the random
read/write and sequential read/write are critical. These should be tuned
and understood before going on to more complicated things.
However, once you actually go and set up a database test, there are tons of
questions -- what type of database? what type of query load? what type of
mix? how big? In my case, the answer is, our database, our queries, and
big. That takes a lot of setup effort, and redoing it for each new file
system will take a long time in my case -- pg_restore takes a day+.
Therefore, I'd like to know ahead of time what file system + configuration
combinations are a waste of time because they don't perform under
concurrency with mixed workload. Thats my admiteddly greedy need for the
extra test results.
> With all that, I think I just gave away what the next conference paper I've
> been working on is about.
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
Looking forward to it!
In response to
pgsql-performance by date
|Next:||From: jay||Date: 2008-09-11 01:23:04|
|Subject: 答复: [PERFORM] Improve COPY performance|
|Previous:||From: Scott Marlowe||Date: 2008-09-10 21:06:31|
|Subject: Re: Improve COPY performance for large data sets|