Skip site navigation (1) Skip section navigation (2)

double writes using "double-write buffer" approach [WIP]

From: Dan Scales <scales(at)vmware(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: double writes using "double-write buffer" approach [WIP]
Date: 2012-01-27 22:31:54
Message-ID: 1962493974.656458.1327703514780.JavaMail.root@zimbra-prod-mbox-4.vmware.com (view raw or flat)
Thread:
Lists: pgsql-hackers
I've been prototyping the double-write buffer idea that Heikki and Simon
had proposed (as an alternative to a previous patch that only batched up
writes by the checkpointer).  I think it is a good idea, and can help
double-writes perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is
trickier to reason about.

The idea is that all page writes generally are copied to a double-write
buffer, rather than being immediately written.  Note that a full copy of
the page is required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been
added), some writes are pushed out using double writes -- the pages are
first written and fsynced to a double-write file, then written to the
data files, which are then fsynced.  Then double writes allow for fixing
torn pages, so full_page_writes can be turned off (thus greatly reducing
the size of the WAL log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
    enough batch has accumulated, then flush the batch using double
    writes.  [I don't think I need to intercept calls to smgrextend(),
    but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
    particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
    current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
    needs a double write or not.  (I think eventually this would be an
    attribute of the buffer, set when the page is WAL-logged, rather than
    a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
    rarely happens).

To get good performance, I needed to have two double-write buffers, one
for the checkpointer and one for all other processes.  The double-write
buffers are circular buffers.  The checkpointer double-write buffer is
just a single batch of 64 pages; the non-checkpointer double-write buffer
is 128 pages, 2 batches of 64 pages each.  Each batch goes to a different
double-write file, so that they can be issued independently as soon as
each batch is completed.  Also, I need to sort the buffers being
checkpointed by file/offset (see ioseq.c), so that the checkpointer
batches will most likely only have to write and fsync one data file.

Interestingly, I find that the plot of tpm for DBT2 is much smoother
(though still has wiggles) with double writes enabled, since there are no
unpredictable long fsyncs at the end (or during) a checkpoint.

Here are performance numbers for double-write buffer (same configs as
previous numbers), for 2-processor, 60-minute 50-warehouse DBT2.  One the
right shows the size of the shared_buffers, and the size of the RAM in
the virtual machine.  FPW stands for full_page_writes, DW for
double_writes.  'two disk' means the WAL log is on a separate ext3
filesystem from the data files.

           FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713                    [5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   11159    9758                    [3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   10829    9865    5806                    [1G buffers, 8G VM]
two disk   13605   12694    5682

one disk:   6752    6129    4878
two disk:   7253    6677    5239                    [1G buffers, 2G VM]


The performance of DW on the small cache cases (1G shared_buffers) is now
much better, though still not as good as FPW on.  In the medium cache
case (3G buffers), where there are significant backend dirty evictions,
the performance of DW is close to that of FPW on.  In the large cache (5G
buffers), where the checkpointer can do all the work and there are
minimal dirty evictions, DW is much better than FPW in the two disk case.
In the one disk case, it is somewhat worse than FPW.  However,
interestingly, if you just move the double-write files to a separate ext3
filesystem on the same disk as the data files, the performance goes to
13107 -- now on par with FPW on.  We are obviously getting hit by the
ext3 fsync slowness issues.  (I believe that an fsync on a filesystem can
stall on other unrelated writes to the same filesystem.)

Let me know if you have any thoughts/comments, etc.  The patch is
enclosed, and the README.doublewrites is updated a fair bit.

Thanks,

Dan

Attachment: dwbuf2.patch
Description: text/x-patch (88.1 KB)

Responses

pgsql-hackers by date

Next:From: Josh KupershmidtDate: 2012-01-27 22:43:51
Subject: Re: Dry-run mode for pg_archivecleanup
Previous:From: hubert depesz lubaczewskiDate: 2012-01-27 22:19:18
Subject: pg_dump -s dumps data?!

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group