Re: Sequential Scan Read-Ahead

From: Curt Sampson <cjs(at)cynic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequential Scan Read-Ahead
Date: 2002-04-25 07:28:51
Message-ID: Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 24 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs(at)cynic(dot)net> writes:
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
> > *too* big and you use the data. A single 64K read takes very little
> > longer than a single 8K read.
>
> Proof?

Well, there are various sorts of "proof" for this assertion. What
sort do you want?

Here's a few samples; if you're looking for something different to
satisfy you, let's discuss it.

1. Theoretical proof: two components of the delay in retrieving a
block from disk are the disk arm movement and the wait for the
right block to rotate under the head.

When retrieving, say, eight adjacent blocks, these will be spread
across no more than two cylinders (with luck, only one). The worst
case access time for a single block is the disk arm movement plus
the full rotational wait; this is the same as the worst case for
eight blocks if they're all on one cylinder. If they're not on one
cylinder, they're still on adjacent cylinders, requiring a very
short seek.

2. Proof by others using it: SQL server uses 64K reads when doing
table scans, as they say that their research indicates that the
major limitation is usually the number of I/O requests, not the
I/O capacity of the disk. BSD's explicitly separates the optimum
allocation size for storage (1K fragments) and optimum read size
(8K blocks) because they found performance to be much better when
a larger size block was read. Most file system vendors, too, do
read-ahead for this very reason.

3. Proof by testing. I wrote a little ruby program to seek to a
random point in the first 2 GB of my raw disk partition and read
1-8 8K blocks of data. (This was done as one I/O request.) (Using
the raw disk partition I avoid any filesystem buffering.) Here are
typical results:

125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
250 reads of 8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
500 reads of 4x8K blocks: 2.5 sec, 199 req/sec. 5.03 ms/req, 1.26 ms/block
1000 reads of 2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
2000 reads of 1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block

The ratios of data retrieval speed per read for groups of adjacent
8K blocks, assuming a single 8K block reads in 1 time unit, are:

1 block 1.00
2 blocks 1.18
4 blocks 1.56
8 blocks 2.34
16 blocks 4.68

At less than 20% more expensive, certainly two-block read requests
could be considered to cost "very little more" than one-block read
requests. Even four-block read requests are only half-again as
expensive. And if you know you're really going to be using the
data, read in 8 block chunks and your cost per block (in terms of
time) drops to less than a third of the cost of single-block reads.

Let me put paid to comments about multiple simultaneous readers
making this invalid. Here's a typical result I get with four
instances of the program running simultaneously:

125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block

Here's the ratio table again, with another column comparing the
aggregate number of requests per second for one process and four
processes:

1 block 1.00 310 : 440
2 blocks 1.10 262 : 401
4 blocks 1.28 199 : 346
8 blocks 1.69 132 : 260
16 blocks 3.89 66 : 113

Note that, here the relative increase in performance for increasing
sizes of reads is even *better* until we get past 64K chunks. The
overall throughput is better, of course, because with more requests
per second coming in, the disk seek ordering code has more to work
with and the average seek time spent seeking vs. reading will be
reduced.

You know, this is not rocket science; I'm sure there must be papers
all over the place about this. If anybody still disagrees that it's
a good thing to read chunks up to 64K or so when the blocks are
adjacent and you know you'll need the data, I'd like to see some
tangible evidence to support that.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2002-04-25 07:39:12 Re: PostgreSQL index usage discussion.
Previous Message Luis Alberto Amigo Navarro 2002-04-25 06:56:44 Re: PostgreSQL index usage discussion.