Re: double writes using "double-write buffer" approach [WIP]

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dan Scales <scales(at)vmware(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: double writes using "double-write buffer" approach [WIP]
Date: 2012-02-03 21:48:54
Message-ID: CA+TgmobCJEVmnWGam7EmWAeZ5zGWYFN4QmC11Ha6JzdeTdX3aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales(at)vmware(dot)com> wrote:
> Thanks for the feedback!  I think you make a good point about the small size of dirty data in the OS cache.  I think what you can say about this double-write patch is that it will work not work well for configurations that have a small Postgres cache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL. I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers. But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway. I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
The entire database fits in shared_buffers, with most of it left
over. However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly. Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects. It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection. And even if we did, that would just
delay the day of reckoning. You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory. Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS. If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that. Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment. And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create new, empty ones, because we assume that
the pages from the old files may still be present in the OS cache. We
also rely on the fact that an evicted CLOG page can be pulled back in
quickly without (in most cases) a disk access. We also rely on
shared_buffers not being too large to avoid walloping the I/O
controller too hard at checkpoint time - which is forcing some people
to set shared_buffers much smaller than would otherwise be ideal. In
other words, even if setting shared_buffers to most of the available
system memory would fix the problem I mentioned, it would create a
whole bunch of new ones, many of them non-trivial. It may be a good
idea to think about what we'd need to do to work efficiently in that
sort of configuration, but there is going to be a very large amount of
thinking, testing, and engineering that has to be done to make it a
reality.

There's another issue here, too. The idea that we're going to write
data to the double-write buffer only when we decide to evict the pages
strikes me as a bad one. We ought to proactively start dumping pages
to the double-write area as soon as they're dirtied, and fsync them
after every N pages, so that by the time we need to evict some page
that requires a double-write, it's already durably on disk in the
double-write buffer, and we can do the real write without having to
wait. It's likely that, to make this perform acceptably for bulk
loads, you'll need the writes to the double-write buffer and the
fsyncs of that buffer to be done by separate processes, so that one
backend (the background writer, perhaps) can continue spooling
additional pages to the double-write files while some other process (a
new auxiliary process?) fsyncs the ones that are already full. Along
with that, the page replacement algorithm probably needs to be
adjusted to avoid evicting pages that need an as-yet-unfinished
double-write like the plague, even to the extent of allowing the
BufferAccessStrategy rings to grow if the double-writes can't be
finished before the ring wraps around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-03 21:51:57 Re: Review of: explain / allow collecting row counts without timing info
Previous Message Robert Haas 2012-02-03 20:57:36 Re: LWLockWaitUntilFree (was: Group commit, revised)