Re: Sequential Scan Read-Ahead

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Curt Sampson <cjs(at)cynic(dot)net>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequential Scan Read-Ahead
Date: 2002-04-25 13:54:32
Message-ID: 25056.1019742872@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Curt Sampson <cjs(at)cynic(dot)net> writes:
> 1. Theoretical proof: two components of the delay in retrieving a
> block from disk are the disk arm movement and the wait for the
> right block to rotate under the head.

> When retrieving, say, eight adjacent blocks, these will be spread
> across no more than two cylinders (with luck, only one).

Weren't you contending earlier that with modern disk mechs you really
have no idea where the data is? You're asserting as an article of
faith that the OS has been able to place the file's data blocks
optimally --- or at least well enough to avoid unnecessary seeks.
But just a few days ago I was getting told that random_page_cost
was BS because there could be no such placement.

I'm getting a tad tired of sweeping generalizations offered without
proof, especially when they conflict.

> 3. Proof by testing. I wrote a little ruby program to seek to a
> random point in the first 2 GB of my raw disk partition and read
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> the raw disk partition I avoid any filesystem buffering.)

And also ensure that you aren't testing the point at issue.
The point at issue is that *in the presence of kernel read-ahead*
it's quite unclear that there's any benefit to a larger request size.
Ideally the kernel will have the next block ready for you when you
ask, no matter what the request is.

There's been some talk of using the AIO interface (where available)
to "encourage" the kernel to do read-ahead. I don't foresee us
writing our own substitute filesystem to make this happen, however.
Oracle may have the manpower for that sort of boondoggle, but we
don't...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2002-04-25 14:18:26 Re: Vote totals for SET in aborted transaction
Previous Message mlw 2002-04-25 13:49:40 Re: Block size: 8K or 16K?