Skip site navigation (1) Skip section navigation (2)

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
Date: 2010-11-16 20:39:20
Message-ID: 4CE2EBF8.4040602@2ndquadrant.com (view raw or flat)
Thread:
Lists: pgsql-performance
Time for a deeper look at what's going on here...I installed RHEL6 Beta 
2 yesterday, on the presumption that since the release version just came 
out this week it was likely the same version Marti tested against.  
Also, it was the one I already had a DVD to install for.  This was on a 
laptop with 7200 RPM hard drive, already containing an Ubuntu 
installation for comparison sake.

Initial testing was done with the PostgreSQL test_fsync utility, just to 
get a gross idea of what situations the drives involved were likely 
flushing data to disk correctly during, and which it was impossible for 
that to be true.  7200 RPM = 120 rotations/second, which puts an upper 
limit of 120 true fsync executions per second.  The test_fsync released 
with PostgreSQL 9.0 now reports its value on the right scale that you 
can directly compare against that (earlier versions reported 
seconds/commit, not commits/second).

First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD 
checkout:

$ cd [PostgreSQL source code tree]
$ cd src/tools/fsync/
$ make

And I started with looking at the Ubuntu system running ext3, which 
represents the status quo we've been seeing the past few years.  
Initially the drive write cache was turned on:

Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC 
2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"

/dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro)

$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      88476.784/second

Compare file sync methods using one write:
    (unavailable: open_datasync)
    open_sync 8k write             1192.135/second
    8k write, fdatasync            1222.158/second
    8k write, fsync                1097.980/second

Compare file sync methods using two writes:
    (unavailable: open_datasync)
    2 open_sync 8k writes           527.361/second
    8k write, 8k write, fdatasync  1105.204/second
    8k write, 8k write, fsync      1084.050/second

Compare open_sync with different sizes:
    open_sync 16k write             966.047/second
    2 open_sync 8k writes           529.565/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close         1064.177/second
    8k write, close, fsync         1042.337/second

Two notable things here.  One, there is no open_datasync defined in this 
older kernel.  Two, all methods of commit give equally inflated commit 
rates, far faster than the drive is capable of.  This proves this setup 
isn't flushing the drive's write cache after commit.

You can get safe behavior out of the old kernel by disabling its write 
cache:

$ sudo /sbin/hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

Loops = 10000

Simple write:
    8k write                      89023.413/second

Compare file sync methods using one write:
    (unavailable: open_datasync)
    open_sync 8k write              106.968/second
    8k write, fdatasync             108.106/second
    8k write, fsync                 104.238/second

Compare file sync methods using two writes:
    (unavailable: open_datasync)
    2 open_sync 8k writes            51.637/second
    8k write, 8k write, fdatasync   109.256/second
    8k write, 8k write, fsync       103.952/second

Compare open_sync with different sizes:
    open_sync 16k write             109.562/second
    2 open_sync 8k writes            52.752/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close          107.179/second
    8k write, close, fsync          106.923/second

And now results are as expected:  just under 120/second.

Onto RHEL6.  Setup for this initial test was:

$ uname -a
Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010 
x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
$ mount
/dev/sda7 on / type ext4 (rw)

And I started with the write cache off to see a straight comparison 
against the above:

$ sudo hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)
$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      104194.886/second

Compare file sync methods using one write:
    open_datasync 8k write           97.828/second
    open_sync 8k write              109.158/second
    8k write, fdatasync             109.838/second
    8k write, fsync                  20.872/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.902/second
    2 open_sync 8k writes            53.721/second
    8k write, 8k write, fdatasync   109.731/second
    8k write, 8k write, fsync        20.918/second

Compare open_sync with different sizes:
    open_sync 16k write             109.552/second
    2 open_sync 8k writes            54.116/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close           20.800/second
    8k write, close, fsync           20.868/second

A few changes then.  open_datasync is available now.  It looks slightly 
slower than the alternatives on this test, but I didn't see that on the 
later tests so I'm thinking that's just occasional run to run 
variation.  For some reason regular fsync is dramatically slower in this 
kernel than earlier ones.  Perhaps a lot more metadata being flushed all 
the way to the disk in that case now?

The issue that I think Marti has been concerned about is highlighted in 
this interesting subset of the data:

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.902/second
    8k write, 8k write, fdatasync   109.731/second

The results here aren't surprising; if you do two dsync writes, that 
will take two disk rotations, while two writes followed a single sync 
only takes one.  But that does mean that in the case of small values for 
wal_buffers, like the default, you could easily end up paying a rotation 
sync penalty more than once per commit.

Next question is what happens if I turn the drive's write cache back on:

$ sudo hdparm -W1 /dev/sda

/dev/sda:
 setting drive write-caching to 1 (on)
 write-caching =  1 (on)

$ ./test_fsync

[gsmith(at)meddle fsync]$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      104198.143/second

Compare file sync methods using one write:
    open_datasync 8k write          110.707/second
    open_sync 8k write              110.875/second
    8k write, fdatasync             110.794/second
    8k write, fsync                  28.872/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        55.731/second
    2 open_sync 8k writes            55.618/second
    8k write, 8k write, fdatasync   110.551/second
    8k write, 8k write, fsync        28.843/second

Compare open_sync with different sizes:
    open_sync 16k write             110.176/second
    2 open_sync 8k writes            55.785/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close           28.779/second
    8k write, close, fsync           28.855/second

This is nice to see from a reliability perspective.  On all three of the 
viable sync methods here, the speed seen suggests the drive's volatile 
write cache is being flushed after every commit.  This is going to be 
bad for people who have gotten used to doing development on systems 
where that's not honored and they don't care, because this looks like a 
90% drop in performance on those systems.  But since the new behavior is 
safe and the earlier one was not, it's hard to get mad about it.  
Developers probably just need to be taught to turn synchronous_commit 
off to speed things up when playing with test data.

test_fsync writes to /var/tmp/test_fsync.out by default, not paying 
attention to what directory you're in.  So to use it to test another 
filesystem, you have to make sure to give it an explicit full path.  
Next I tested against the old Ubuntu partition that was formatted with 
ext3, with the write cache still on:

# mount | grep /ext3
/dev/sda5 on /ext3 type ext3 (rw)
# ./test_fsync -f /ext3/test_fsync.out
Loops = 10000

Simple write:
    8k write                      100943.825/second

Compare file sync methods using one write:
    open_datasync 8k write          106.017/second
    open_sync 8k write              108.318/second
    8k write, fdatasync             108.115/second
    8k write, fsync                 105.270/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.313/second
    2 open_sync 8k writes            54.045/second
    8k write, 8k write, fdatasync    55.291/second
    8k write, 8k write, fsync        53.243/second

Compare open_sync with different sizes:
    open_sync 16k write              54.980/second
    2 open_sync 8k writes            53.563/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close          105.032/second
    8k write, close, fsync          103.987/second

Strange...it looks like ext3 is executing cache flushes, too.  Note that 
all of the "Compare file sync methods using two writes" results are half 
speed now; it's as if ext3 is flushing the first write out immediately?  
This result was unexpected, and I don't trust it yet; I want to validate 
this elsewhere.

What about XFS?  That's a first class filesystem on RHEL6 too:

[root(at)meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
Loops = 10000

Simple write:
    8k write                      71878.324/second

Compare file sync methods using one write:
    open_datasync 8k write           36.303/second
    open_sync 8k write               35.714/second
    8k write, fdatasync              35.985/second
    8k write, fsync                  35.446/second

I stopped that there, sick of waiting for it, as there's obviously some 
serious work (mounting options or such at a minimum) that needs to be 
done before XFS matches the other two.  Will return to that later.

So, what have we learned so far:

1) On these newer kernels, both ext4 and ext3 seem to be pushing data 
out through the drive write caches correctly.

2) On single writes, there's no performance difference between the main 
three methods you might use, with the straight fsync method having a 
serious regression in this use case.

3) WAL writes that are forced by wal_buffers filling will turn into a 
commit-length write when using the new, default open_datasync.  Using 
the older default of fdatasync avoids that problem, in return for 
causing WAL writes to pollute the OS cache.  The main benefit of O_DSYNC 
writes over fdatasync ones is avoiding the OS cache.

I want to next go through and replicate some of the actual database 
level tests before giving a full opinion on whether this data proves 
it's worth changing the wal_sync_method detection.  So far I'm torn 
between whether that's the right approach, or if we should just increase 
the default value for wal_buffers to something more reasonable.

-- 
Greg Smith   2ndQuadrant US    greg(at)2ndQuadrant(dot)com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


In response to

Responses

pgsql-performance by date

Next:From: Robert HaasDate: 2010-11-16 23:10:12
Subject: Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
Previous:From: Chris BrowneDate: 2010-11-16 16:35:24
Subject: Re: best db schema for time series data?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group