Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
Date: 2004-10-22 04:12:44
Message-ID: 2823.1098418364@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> When a backend wishes to write a page, the following steps are taken:
> ...
> 2) Backend mmap(2)'s a second copy of the page(s) being written to,
> this time with the MAP_PRIVATE flag set.
> ...
> 5) Once the WAL logging is complete and it has hit the disk, the
> backend msync(2)'s its private copy of the pages to disk (ASYNC or
> SYNC, it doesn't really matter too much to me).

My man page for mmap says that changes in a MAP_PRIVATE region are
private; they do not affect the file at all, msync or no. So I don't
think the above actually works.

In any case, this scheme still forces you to flush WAL records to disk
before making the changed page visible to other backends, so I don't
see how it improves the situation. In the existing scheme we only have
to fsync WAL at (1) transaction commit, (2) when we are forced to write
a page out from shared buffers because we are short of buffers, or (3)
checkpoint. Anything that implies an fsync per atomic action is going
to be a loser. It does not matter how great your kernel API is if you
only get to perform one atomic action per disk rotation :-(

The important point here is that you can't postpone making changes at
the page level visible to other backends; there's no MVCC at this level.
Consider for example two backends wanting to insert a new row. If they
both MAP_PRIVATE the same page, they'll probably choose the same tuple
slot on the page to insert into (certainly there is nothing to stop that
from happening). Now you have conflicting definitions for the same
CTID, not to mention probably conflicting uses of the page's physical
free space; disaster ensues. So "atomic action" really means "lock
page, make changes, add WAL record to in-memory WAL buffers, unlock
page" with the understanding that as soon as you unlock the page the
changes you've made in it are visible to all other backends. You
*can't* afford to put a WAL fsync in this sequence.

You could possibly buy back most of the lossage in this scenario if
there were some efficient way for a backend to hold the low-level lock
on a page just until some other backend wanted to modify the page;
whereupon the previous owner would have to do what's needed to make his
changes visible before releasing the lock. Given the right access
patterns you don't have to fsync very often (though given the wrong
access patterns you're still in deep trouble). But we don't have any
such mechanism and I think the communication costs of one would be
forbidding.

> [ much snipped ]
> 4) Not having shared pages get lost when the backend dies (mmap(2) uses
> refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean).

Actually, that is not a bug that's a feature. One of the things that
scares me about mmap is that a crashing backend is able to scribble all
over live disk buffers before it finally SEGV's (think about memcpy gone
wrong and similar cases). In our existing scheme there's a pretty good
chance that we will be able to commit hara-kiri before any of the
trashed data gets written out. In an mmap scheme, it's time to dig out
your backup tapes, because there simply is no distinction between
transient and permanent data --- the kernel has no way to know that you
didn't mean it.

In short, I remain entirely unconvinced that mmap is of any interest to us.

regards, tom lane

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2004-10-22 05:03:34 Re: Large Database Performance suggestions
Previous Message Tom Lane 2004-10-22 03:37:00 Re: Large Database Performance suggestions