Re: double writes using "double-write buffer" approach [WIP]

From: Dan Scales <scales(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: double writes using "double-write buffer" approach [WIP]
Date: 2012-02-03 20:14:29
Message-ID: 1134198121.1082096.1328300069338.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert,

Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think what you can say about this double-write patch is that it will work not work well for configurations that have a small Postgres cache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. However, it should work much better for configurations with a much large Postgres cache and relatively smaller OS cache (including the configurations that I've given performance results for). In that case, there is a lot more capacity for dirty pages in the Postgres cache, and you won't have nearly as many dirty evictions. The checkpointer is doing a good number of the writes, and this patch sorts the checkpointer's buffers so its IO is efficient.

Of course, I can also increase the size of the non-checkpointer ring buffer to be much larger, though I wouldn't want to make it too large, since it is consuming memory. If I increase the size of the ring buffers significantly, I will probably need to add some data structures so that the ring buffer lookups in smgrread() and smgrwrite() are more efficient.

Can you let me know what the shared_buffers and RAM sizes were for your pgbench run? I can try running the same workload. If the size of shared_buffers is especially small compared to RAM, then we should increase the size of shared_buffers when using double_writes.

Thanks,

Dan

----- Original Message -----
From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Dan Scales" <scales(at)vmware(dot)com>
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Sent: Thursday, February 2, 2012 7:19:47 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales <scales(at)vmware(dot)com> wrote:
> I've been prototyping the double-write buffer idea that Heikki and Simon
> had proposed (as an alternative to a previous patch that only batched up
> writes by the checkpointer).  I think it is a good idea, and can help
> double-writes perform better in the case of lots of backend evictions.
> It also centralizes most of the code change in smgr.c.  However, it is
> trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers. AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk. That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB. Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction. That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use. I tried running "pgbench -i -s 10" with this patch applied,
full_page_writes=off, double_writes=on. It took 41.2 seconds to
complete. The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things. I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-03 20:57:36 Re: LWLockWaitUntilFree (was: Group commit, revised)
Previous Message Tomas Vondra 2012-02-03 19:56:13 Re: Review of: explain / allow collecting row counts without timing info