Re: Buffer Management

From: Curt Sampson <cjs(at)cynic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Buffer Management
Date: 2002-06-26 04:13:42
Message-ID: Pine.NEB.4.43.0206261149170.670-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 25 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs(at)cynic(dot)net> writes:
>
> > I don't understand why there would be any loss of visibility of changes.
> > If two backends mmap the same block of a file, and it's shared, that's
> > the same block of physical memory that they're accessing.
>
> Is it? You have a mighty narrow conception of the range of
> implementations that's possible for mmap.

It's certainly possible to implement something that you call mmap
that is not. But if you are using the posix-defined MAP_SHARED flag,
the behaviour above is what you see. It might be implemented slightly
differently internally, but that's no concern of postgres. And I find
it pretty unlikely that it would be implemented otherwise without good
reason.

Note that your proposal of using mmap to replace sysv shared memory
relies on the behaviour I've described too. As well, if you're replacing
sysv shared memory with an mmap'd file, you may end up doing excessive
disk I/O on systems without the MAP_NOSYNC option. (Without this option,
the update thread/daemon may ensure that every buffer is flushed to the
backing store on disk every 30 seconds or so. You might be able to get
around this by using a small file-backed area for things that need to
persist after a crash, and a larger anonymous area for things that don't
need to persist after a crash.)

> But the main problem is that mmap doesn't let us control when changes to
> the memory buffer will get reflected back to disk --- AFAICT, the OS is
> free to do the write-back at any instant after you dirty the page, and
> that completely breaks the WAL algorithm. (WAL = write AHEAD log;
> the log entry describing a change must hit disk before the data page
> change itself does.)

Hm. Well ,we could try not to write the data to the page until
after we receive notification that our WAL data is committed to
stable storage. However, new the data has to be availble to all of
the backends at the exact time that the commit happens. Perhaps a
shared list of pending writes?

Another option would be to just let it write, but on startup, scan
all of the data blocks in the database for tuples that have a
transaction ID later than the last one we updated to, and remove
them. That could pretty darn expensive on a large database, though.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Clift 2002-06-26 06:07:35 Nextgres?
Previous Message Jonah H. Harris 2002-06-26 03:36:30 TPC-C Benchmarks