Re: Sequential Scan Read-Ahead

From: Curt Sampson <cjs(at)cynic(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequential Scan Read-Ahead
Date: 2002-04-25 03:19:14
Message-ID: Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 24 Apr 2002, Bruce Momjian wrote:

> > 1. Not all systems do readahead.
>
> If they don't, that isn't our problem. We expect it to be there, and if
> it isn't, the vendor/kernel is at fault.

It is your problem when another database kicks Postgres' ass
performance-wise.

And at that point, *you're* at fault. You're the one who's knowingly
decided to do things inefficiently.

Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
attitude gets me steamed. It's one thing to say, "We don't support
this." That's fine; there are often good reasons for that. It's a
completely different thing to say, "It's an unrelated entity's fault we
don't support this."

At any rate, relying on the kernel to guess how to optimise for
the workload will never work as well as well as the software that
knows the workload doing the optimization.

The lack of support thing is no joke. Sure, lots of systems nowadays
support unified buffer cache and read-ahead. But how many, besides
Solaris, support free-behind, which is also very important to avoid
blowing out your buffer cache when doing sequential reads? And who
at all supports read-ahead for reverse scans? (Or does Postgres
not do those, anyway? I can see the support is there.)

And even when the facilities are there, you create problems by
using them. Look at the OS buffer cache, for example. Not only do
we lose efficiency by using two layers of caching, but (as people
have pointed out recently on the lists), the optimizer can't even
know how much or what is being cached, and thus can't make decisions
based on that.

> Yes, seek() in file will turn off read-ahead. Grabbing bigger chunks
> would help here, but if you have two people already reading from the
> same file, grabbing bigger chunks of the file may not be optimal.

Grabbing bigger chunks is always optimal, AFICT, if they're not
*too* big and you use the data. A single 64K read takes very little
longer than a single 8K read.

> > 3. Even when the read-ahead does occur, you're still doing more
> > syscalls, and thus more expensive kernel/userland transitions, than
> > you have to.
>
> I would guess the performance impact is minimal.

If it were minimal, people wouldn't work so hard to build multi-level
thread systems, where multiple userland threads are scheduled on
top of kernel threads.

However, it does depend on how much CPU your particular application
is using. You may have it to spare.

> http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html

Well, this message has some points in it that I feel are just incorrect.

1. It is *not* true that you have no idea where data is when
using a storage array or other similar system. While you
certainly ought not worry about things such as head positions
and so on, it's been a given for a long, long time that two
blocks that have close index numbers are going to be close
together in physical storage.

2. Raw devices are quite standard across Unix systems (except
in the unfortunate case of Linux, which I think has been
remedied, hasn't it?). They're very portable, and have just as
well--if not better--defined write semantics as a filesystem.

3. My observations of OS performance tuning over the past six
or eight years contradict the statement, "There's a considerable
cost in complexity and code in using "raw" storage too, and
it's not a one off cost: as the technologies change, the "fast"
way to do things will change and the code will have to be
updated to match." While optimizations have been removed over
the years the basic optimizations (order reads by block number,
do larger reads rather than smaller, cache the data) have
remained unchanged for a long, long time.

4. "Better to leave this to the OS vendor where possible, and
take advantage of the tuning they do." Well, sorry guys, but
have a look at the tuning they do. It hasn't changed in years,
except to remove now-unnecessary complexity realated to really,
really old and slow disk devices, and to add a few thing that
guess workload but still do a worse job than if the workload
generator just did its own optimisations in the first place.

> http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html

Well, this one, with statements like "Postgres does have control
over its buffer cache," I don't know what to say. You can interpret
the statement however you like, but in the end Postgres very little
control at all over how data is moved between memory and disk.

BTW, please don't take me as saying that all control over physical
IO should be done by Postgres. I just think that Posgres could do
a better job of managing data transfer between disk and memory than
the OS can. The rest of the things (using raw paritions, read-ahead,
free-behind, etc.) just drop out of that one idea.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-04-25 03:30:20 Re: Sequential Scan Read-Ahead
Previous Message Hiroshi Inoue 2002-04-25 03:11:51 Re: Vote on SET in aborted transaction