Buffer Management

From: Curt Sampson <cjs(at)cynic(dot)net>
To: "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Buffer Management
Date: 2002-06-25 05:05:45
Message-ID: Pine.NEB.4.43.0206251232130.17448-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm splitting off this buffer mangement stuff into a separate thread.

On 24 Jun 2002, J. R. Nield wrote:

> I'll back off on that. I don't know if we want to use the OS buffer
> manager, but shouldn't we try to have our buffer manager group writes
> together by files, and pro-actively get them out to disk?

The only way the postgres buffer manager can "get [data] out to disk"
is to do an fsync(). For data files (as opposed to log files), this can
only slow down overall system throughput, as this would only disrupt the
OS's write management.

> Right now, it
> looks like all our write requests are delayed as long as possible and
> the order in which they are written is pretty-much random, as is the
> backend that writes the block, so there is no locality of reference even
> when the blocks are adjacent on disk, and the write calls are spread-out
> over all the backends.

It doesn't matter. The OS will introduce locality of reference with its
write algorithms. Take a look at

http://www.cs.wisc.edu/~solomon/cs537/disksched.html

for an example. Most OSes use the elevator or one-way elevator
algorithm. So it doesn't matter whether it's one back-end or many
writing, and it doesn't matter in what order they do the write.

> Would it not be the case that things like read-ahead, grouping writes,
> and caching written data are probably best done by PostgreSQL, because
> only our buffer manager can understand when they will be useful or when
> they will thrash the cache?

Operating systems these days are not too bad at guessing guessing what
you're doing. Pretty much every OS I've seen will do read-ahead when
it detects you're doing sequential reads, at least in the forward
direction. And Solaris is even smart enough to mark the pages you've
read as "not needed" so that they quickly get flushed from the cache,
rather than blowing out your entire cache if you go through a large
file.

> Would O_DSYNC|O_RSYNC turn off the cache?

No. I suppose there's nothing to stop it doing so, in some
implementations, but the interface is not designed for direct I/O.

> Since you know a lot about NetBSD internals, I'd be interested in
> hearing about what postgresql looks like to the NetBSD buffer manager.

Well, looks like pretty much any program, or group of programs,
doing a lot of I/O. :-)

> Am I right that strings of successive writes get randomized?

No; as I pointed out, they in fact get de-randomized as much as
possible. The more proceses you have throwing out requests, the better
the throughput will be in fact.

> What do our cache-hit percentages look like? I'm going to do some
> experimenting with this.

Well, that depends on how much memory you have and what your working
set is. :-)

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Curt Sampson 2002-06-25 05:08:59 Re: Index Scans become Seq Scans after VACUUM ANALYSE
Previous Message Josh Berkus 2002-06-25 03:32:44 Re: Request for builtin function: Double_quote