mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...

From: Aaron Werman <aaron(dot)werman(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org, Kevin Brown <kevin(at)sysexperts(dot)com>
Subject: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
Date: 2004-10-15 00:25:36
Message-ID: 157f64840410141725162e43b5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

pg to my mind is unique in not trying to avoid OS buffering. Other
dbmses spend a substantial effort to create a virtual OS (task
management, I/O drivers, etc.) both in code and support. Choosing mmap
seems such a limiting an option - it adds OS dependency and limits
kernel developer options (2G limits, global mlock serializations,
porting problems, inability to schedule or parallelize I/O, still
having to coordinate writers and readers).

More to the point, I think it is very hard to effectively coordinate
multithreaded I/O, and mmap seems used mostly to manage relatively
simple scenarios. If the I/O options are:
- OS (which has enormous investment and is stable, but is general
purpose with overhead)
- pg (direct I/O would be costly and potentially destabilizing, but
with big possible performance rewards)
- mmap (a feature mostly used to reduce buffer copies in less
concurrent apps such as image processing that has major architectural
risk including an order of magnitude more semaphores, but can reduce
some extra block copies)
mmap doesn't look that promising.

/Aaron

----- Original Message -----
From: "Kevin Brown" <kevin(at)sysexperts(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>
Sent: Thursday, October 14, 2004 4:25 PM
Subject: Re: [PERFORM] First set of OSDL Shared Mem scalability
results, some wierdness ...

> Tom Lane wrote:
> > Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > > Tom Lane wrote:
> > >> mmap() is Right Out because it does not afford us sufficient control
> > >> over when changes to the in-memory data will propagate to disk.
> >
> > > ... that's especially true if we simply cannot
> > > have the page written to disk in a partially-modified state (something
> > > I can easily see being an issue for the WAL -- would the same hold
> > > true of the index/data files?).
> >
> > You're almost there. Remember the fundamental WAL rule: log entries
> > must hit disk before the data changes they describe. That means that we
> > need not only a way of forcing changes to disk (fsync) but a way of
> > being sure that changes have *not* gone to disk yet. In the existing
> > implementation we get that by just not issuing write() for a given page
> > until we know that the relevant WAL log entries are fsync'd down to
> > disk. (BTW, this is what the LSN field on every page is for: it tells
> > the buffer manager the latest WAL offset that has to be flushed before
> > it can safely write the page.)
> >
> > mmap provides msync which is comparable to fsync, but AFAICS it
> > provides no way to prevent an in-memory change from reaching disk too
> > soon. This would mean that WAL entries would have to be written *and
> > flushed* before we could make the data change at all, which would
> > convert multiple updates of a single page into a series of write-and-
> > wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction
> > is bad enough, once per atomic action is intolerable.
>
> Hmm...something just occurred to me about this.
>
> Would a hybrid approach be possible? That is, use mmap() to handle
> reads, and use write() to handle writes?
>
> Any code that wishes to write to a page would have to recognize that
> it's doing so and fetch a copy from the storage manager (or
> something), which would look to see if the page already exists as a
> writeable buffer. If it doesn't, it creates it by allocating the
> memory and then copying the page from the mmap()ed area to the new
> buffer, and returning it. If it does, it just returns a pointer to
> the buffer. There would obviously have to be some bookkeeping
> involved: the storage manager would have to know how to map a mmap()ed
> page back to a writeable buffer and vice-versa, so that once it
> decides to write the buffer it can determine which page in the
> original file the buffer corresponds to (so it can do the appropriate
> seek()).
>
> In a write-heavy database, you'll end up with a lot of memory copy
> operations, but with the scheme we currently use you get that anyway
> (it just happens in kernel code instead of user code), so I don't see
> that as much of a loss, if any. Where you win is in a read-heavy
> database: you end up being able to read directly from the pages in the
> kernel's page cache and thus save a memory copy from kernel space to
> user space, not to mention the context switch that happens due to
> issuing the read().
>
>
> Obviously you'd want to mmap() the file read-only in order to prevent
> the issues you mention regarding an errant backend, and then reopen
> the file read-write for the purpose of writing to it. In fact, you
> could decouple the two: mmap() the file, then close the file -- the
> mmap()ed region will remain mapped. Then, as long as the file remains
> mapped, you need to open the file again only when you want to write to
> it.
>
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
--

Regards,
/Aaron

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Neil Conway 2004-10-15 00:52:35 Re: Performance vs Schemas
Previous Message Christopher Browne 2004-10-15 00:10:59 Re: First set of OSDL Shared Mem scalability results, some wierdness ...