Quick Links

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: First set of OSDL Shared Mem scalability results, some wierdness ...
Date:	2004-10-09 23:05:37
Message-ID:	4859.1097363137@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> mmap() is Right Out because it does not afford us sufficient control
>> over when changes to the in-memory data will propagate to disk.

> ... that's especially true if we simply cannot
> have the page written to disk in a partially-modified state (something
> I can easily see being an issue for the WAL -- would the same hold
> true of the index/data files?).

You're almost there. Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe. That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet. In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk. (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)

mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon. This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.

There is another reason for doing things this way. Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk. It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes. If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.

regards, tom lane

In response to

Re: First set of OSDL Shared Mem scalability results, some wierdness ... at 2004-10-09 20:37:12 from Kevin Brown

Responses

Re: First set of OSDL Shared Mem scalability results, some wierdness ... at 2004-10-14 20:25:31 from Kevin Brown
Re: First set of OSDL Shared Mem scalability results, some at 2004-10-23 07:33:40 from Curt Sampson

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2004-10-09 23:13:45	Re: Security implications of config-file-location patch
Previous Message	Bruce Momjian	2004-10-09 22:09:35	Re: [BUGS] BUG #1270: stack overflow in thread in fe_getauthname

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Gaetano Mendola	2004-10-10 09:19:59	kernel 2.6 synchronous directory
Previous Message	Kevin Brown	2004-10-09 21:01:02	Re: First set of OSDL Shared Mem scalability results, some wierdness ...