Re: double writes using "double-write buffer" approach [WIP]

From: Dan Scales <scales(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: double writes using "double-write buffer" approach [WIP]
Date: 2012-02-05 21:17:15
Message-ID: 1871024608.1144384.1328476635051.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the detailed followup. I do see how Postgres is tuned for
having a bunch of memory available that is not in shared_buffers, both
for the OS buffer cache and other memory allocations. However, Postgres
seems to run fine in many "large shared_memory" configurations that I
gave performance numbers for, including 5G shared_buffers for an 8G
machine, 3G shared_buffers for a 6G machine, etc. There just has to be
sufficient extra memory beyond the shared_buffers cache.

I think the pgbench run is pointing out a problem that this double_writes
implementation has with BULK_WRITEs. As you point out, the
BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.
I'm not sure if there is a great solution that always works for that
issue. However, I do notice that BULK_WRITE data isn't WAL-logged unless
archiving/replication is happening. As I understand it, if the
BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
double-written either. The BULK_WRITE data is not officially synced and
committed until it is all written, so there doesn't have to be any
torn-page protection for that data, which is why the WAL logging can be
omitted. The double-write implementation can be improved by marking each
buffer if it doesn't need torn-page protection. These buffers would be
those new pages that are explicitly not WAL-logged, even when
full_page_writes is enabled. When such a buffer is eventually synced
(perhaps because of an eviction), it would not be double-written. This
would often avoid double-writes for BULK_WRITE, etc., especially since
the administrator is often not archiving or doing replication when doing
bulk loads.

However, overall, I think the idea is that double writes are an optional
optimization. The user would only turn it on in existing configurations
where it helps or only slightly hurts performance, and where greatly
reducing the size of the WAL logs is beneficial. It might also be
especially beneficial when there is a small amount of FLASH or other
kind of fast storage that the double-write files can be stored on.

Thanks,

Dan

----- Original Message -----
From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Dan Scales" <scales(at)vmware(dot)com>
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Sent: Friday, February 3, 2012 1:48:54 PM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales(at)vmware(dot)com> wrote:
> Thanks for the feedback!  I think you make a good point about the small size of dirty data in the OS cache.  I think what you can say about this double-write patch is that it will work not work well for configurations that have a small Postgres cache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL. I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers. But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway. I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
The entire database fits in shared_buffers, with most of it left
over. However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly. Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects. It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection. And even if we did, that would just
delay the day of reckoning. You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory. Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS. If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that. Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment. And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create new, empty ones, because we assume that
the pages from the old files may still be present in the OS cache. We
also rely on the fact that an evicted CLOG page can be pulled back in
quickly without (in most cases) a disk access. We also rely on
shared_buffers not being too large to avoid walloping the I/O
controller too hard at checkpoint time - which is forcing some people
to set shared_buffers much smaller than would otherwise be ideal. In
other words, even if setting shared_buffers to most of the available
system memory would fix the problem I mentioned, it would create a
whole bunch of new ones, many of them non-trivial. It may be a good
idea to think about what we'd need to do to work efficiently in that
sort of configuration, but there is going to be a very large amount of
thinking, testing, and engineering that has to be done to make it a
reality.

There's another issue here, too. The idea that we're going to write
data to the double-write buffer only when we decide to evict the pages
strikes me as a bad one. We ought to proactively start dumping pages
to the double-write area as soon as they're dirtied, and fsync them
after every N pages, so that by the time we need to evict some page
that requires a double-write, it's already durably on disk in the
double-write buffer, and we can do the real write without having to
wait. It's likely that, to make this perform acceptably for bulk
loads, you'll need the writes to the double-write buffer and the
fsyncs of that buffer to be done by separate processes, so that one
backend (the background writer, perhaps) can continue spooling
additional pages to the double-write files while some other process (a
new auxiliary process?) fsyncs the ones that are already full. Along
with that, the page replacement algorithm probably needs to be
adjusted to avoid evicting pages that need an as-yet-unfinished
double-write like the plague, even to the extent of allowing the
BufferAccessStrategy rings to grow if the double-writes can't be
finished before the ring wraps around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-02-05 21:29:20 Re: Report: race conditions in WAL replay routines
Previous Message Tom Lane 2012-02-05 21:03:33 Re: Report: race conditions in WAL replay routines