Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Marti Raudsepp <marti(at)juffo(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
Date: 2010-11-17 20:19:10
Message-ID: 499BFB3F-CD9F-47D2-87E4-058C2A6D63EC@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Nov 16, 2010, at 12:39 PM, Greg Smith wrote:
>
> $ ./test_fsync
> Loops = 10000
>
> Simple write:
> 8k write 88476.784/second
>
> Compare file sync methods using one write:
> (unavailable: open_datasync)
> open_sync 8k write 1192.135/second
> 8k write, fdatasync 1222.158/second
> 8k write, fsync 1097.980/second
>
> Compare file sync methods using two writes:
> (unavailable: open_datasync)
> 2 open_sync 8k writes 527.361/second
> 8k write, 8k write, fdatasync 1105.204/second
> 8k write, 8k write, fsync 1084.050/second
>
> Compare open_sync with different sizes:
> open_sync 16k write 966.047/second
> 2 open_sync 8k writes 529.565/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
> 8k write, fsync, close 1064.177/second
> 8k write, close, fsync 1042.337/second
>
> Two notable things here. One, there is no open_datasync defined in this
> older kernel. Two, all methods of commit give equally inflated commit
> rates, far faster than the drive is capable of. This proves this setup
> isn't flushing the drive's write cache after commit.

Nit: there is no open_sync, only open_dsync. Prior to recent kernels, only (semantically) open_dsync exists, labeled as open_sync. New kernels move that code to open_datasync and nave a NEW open_sync that supposedly flushes metadata properly.

>
> You can get safe behavior out of the old kernel by disabling its write
> cache:
>
> $ sudo /sbin/hdparm -W0 /dev/sda
>
> /dev/sda:
> setting drive write-caching to 0 (off)
> write-caching = 0 (off)
>
> Loops = 10000
>
> Simple write:
> 8k write 89023.413/second
>
> Compare file sync methods using one write:
> (unavailable: open_datasync)
> open_sync 8k write 106.968/second
> 8k write, fdatasync 108.106/second
> 8k write, fsync 104.238/second
>
> Compare file sync methods using two writes:
> (unavailable: open_datasync)
> 2 open_sync 8k writes 51.637/second
> 8k write, 8k write, fdatasync 109.256/second
> 8k write, 8k write, fsync 103.952/second
>
> Compare open_sync with different sizes:
> open_sync 16k write 109.562/second
> 2 open_sync 8k writes 52.752/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
> 8k write, fsync, close 107.179/second
> 8k write, close, fsync 106.923/second
>
> And now results are as expected: just under 120/second.
>
> Onto RHEL6. Setup for this initial test was:
>
> $ uname -a
> Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010
> x86_64 x86_64 x86_64 GNU/Linux
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
> $ mount
> /dev/sda7 on / type ext4 (rw)
>
> And I started with the write cache off to see a straight comparison
> against the above:
>
> $ sudo hdparm -W0 /dev/sda
>
> /dev/sda:
> setting drive write-caching to 0 (off)
> write-caching = 0 (off)
> $ ./test_fsync
> Loops = 10000
>
> Simple write:
> 8k write 104194.886/second
>
> Compare file sync methods using one write:
> open_datasync 8k write 97.828/second
> open_sync 8k write 109.158/second
> 8k write, fdatasync 109.838/second
> 8k write, fsync 20.872/second

fsync is working now! flushing metadata properly reduces performance.
However, shouldn't open_sync slow down vs open_datasync too and be similar to fsync?

Did you recompile your test on the RHEL6 system?
Code compiled on newer kernels will see O_DSYNC and O_SYNC as two separate sentinel values, lets call them 1 and 2 respectively. Code compiled against earlier kernels will see both O_DSYNC and O_SYNC as the same value, 1. So code compiled against older kernels, asking for O_SYNC on a newer kernel will actually get O_DSYNC behavior! This was intended. I can't find the link to the mail, but it was Linus' idea to make old code that expected the 'faster but incorrect' behavior to retain it on newer kernels. Only a recompile with newer header files will trigger the new behavior and expose the 'correct' open_sync behavior.

This will be 'fun' for postgres packagers and users -- data reliability behavior differs based on what kernel it is compiled against. Luckily, the xlogs only need open_datasync semantics.

>
> Compare file sync methods using two writes:
> 2 open_datasync 8k writes 53.902/second
> 2 open_sync 8k writes 53.721/second
> 8k write, 8k write, fdatasync 109.731/second
> 8k write, 8k write, fsync 20.918/second
>
> Compare open_sync with different sizes:
> open_sync 16k write 109.552/second
> 2 open_sync 8k writes 54.116/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
> 8k write, fsync, close 20.800/second
> 8k write, close, fsync 20.868/second
>
> A few changes then. open_datasync is available now.

Again, noting the detail that it is open_sync that is new (depending on where it is compiled). The old open_sync is relabeled to the new open_datasync.

> It looks slightly
> slower than the alternatives on this test, but I didn't see that on the
> later tests so I'm thinking that's just occasional run to run
> variation. For some reason regular fsync is dramatically slower in this
> kernel than earlier ones. Perhaps a lot more metadata being flushed all
> the way to the disk in that case now?
>
> The issue that I think Marti has been concerned about is highlighted in
> this interesting subset of the data:
>
> Compare file sync methods using two writes:
> 2 open_datasync 8k writes 53.902/second
> 8k write, 8k write, fdatasync 109.731/second
>
> The results here aren't surprising; if you do two dsync writes, that
> will take two disk rotations, while two writes followed a single sync
> only takes one. But that does mean that in the case of small values for
> wal_buffers, like the default, you could easily end up paying a rotation
> sync penalty more than once per commit.
>
> Next question is what happens if I turn the drive's write cache back on:
>
> $ sudo hdparm -W1 /dev/sda
>
> /dev/sda:
> setting drive write-caching to 1 (on)
> write-caching = 1 (on)
>
> $ ./test_fsync
>
> [gsmith(at)meddle fsync]$ ./test_fsync
> Loops = 10000
>
> Simple write:
> 8k write 104198.143/second
>
> Compare file sync methods using one write:
> open_datasync 8k write 110.707/second
> open_sync 8k write 110.875/second
> 8k write, fdatasync 110.794/second
> 8k write, fsync 28.872/second
>
> Compare file sync methods using two writes:
> 2 open_datasync 8k writes 55.731/second
> 2 open_sync 8k writes 55.618/second
> 8k write, 8k write, fdatasync 110.551/second
> 8k write, 8k write, fsync 28.843/second
>
> Compare open_sync with different sizes:
> open_sync 16k write 110.176/second
> 2 open_sync 8k writes 55.785/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
> 8k write, fsync, close 28.779/second
> 8k write, close, fsync 28.855/second
>
> This is nice to see from a reliability perspective. On all three of the
> viable sync methods here, the speed seen suggests the drive's volatile
> write cache is being flushed after every commit. This is going to be
> bad for people who have gotten used to doing development on systems
> where that's not honored and they don't care, because this looks like a
> 90% drop in performance on those systems.
> But since the new behavior is
> safe and the earlier one was not, it's hard to get mad about it.

I would love to see the same tests in this detail for RHEL 5.5 (which has ext3, ext4, and xfs). I think this data reliability issue that requires turning off write cache was in the kernel ~2.6.26 to 2.6.31 range. Ubuntu doesn't really care about this stuff which is one reason I avoid it for a prod db. I know that xfs with the right settings on RHEL 5.5 does not require disabling the write cache.

> Developers probably just need to be taught to turn synchronous_commit
> off to speed things up when playing with test data.
>

Absolutely.

> test_fsync writes to /var/tmp/test_fsync.out by default, not paying
> attention to what directory you're in. So to use it to test another
> filesystem, you have to make sure to give it an explicit full path.
> Next I tested against the old Ubuntu partition that was formatted with
> ext3, with the write cache still on:
>
> # mount | grep /ext3
> /dev/sda5 on /ext3 type ext3 (rw)
> # ./test_fsync -f /ext3/test_fsync.out
> Loops = 10000
>
> Simple write:
> 8k write 100943.825/second
>
> Compare file sync methods using one write:
> open_datasync 8k write 106.017/second
> open_sync 8k write 108.318/second
> 8k write, fdatasync 108.115/second
> 8k write, fsync 105.270/second
>
> Compare file sync methods using two writes:
> 2 open_datasync 8k writes 53.313/second
> 2 open_sync 8k writes 54.045/second
> 8k write, 8k write, fdatasync 55.291/second
> 8k write, 8k write, fsync 53.243/second
>
> Compare open_sync with different sizes:
> open_sync 16k write 54.980/second
> 2 open_sync 8k writes 53.563/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
> 8k write, fsync, close 105.032/second
> 8k write, close, fsync 103.987/second
>
> Strange...it looks like ext3 is executing cache flushes, too. Note that
> all of the "Compare file sync methods using two writes" results are half
> speed now; it's as if ext3 is flushing the first write out immediately?
> This result was unexpected, and I don't trust it yet; I want to validate
> this elsewhere.
>
> What about XFS? That's a first class filesystem on RHEL6 too:
and available on later RHEL 5's.
>
> [root(at)meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
> Loops = 10000
>
> Simple write:
> 8k write 71878.324/second
>
> Compare file sync methods using one write:
> open_datasync 8k write 36.303/second
> open_sync 8k write 35.714/second
> 8k write, fdatasync 35.985/second
> 8k write, fsync 35.446/second
>
> I stopped that there, sick of waiting for it, as there's obviously some
> serious work (mounting options or such at a minimum) that needs to be
> done before XFS matches the other two. Will return to that later.
>

Yes, XFS requires some fiddling. Its metadata operations are also very slow.

> So, what have we learned so far:
>
> 1) On these newer kernels, both ext4 and ext3 seem to be pushing data
> out through the drive write caches correctly.
>

I suspect that some older kernels are partially OK here too. The kernel not flushing properly appeared near 2.6.25 ish.

> 2) On single writes, there's no performance difference between the main
> three methods you might use, with the straight fsync method having a
> serious regression in this use case.

I'll ask again -- did you compile the test on RHEL6 for the RHEL6 tests? The behavior in later kernels for this depends on what kernel it was compiled against for open_sync. For fsync, its not a regression, its actually flushing metadata properly and therefore actually robust if there is a power failure during a write. Even the write cache disabled case on the ubuntu kernel could leave a filesystem with corrupt data if the power failed in a metadata intensive write situation.

>
> 3) WAL writes that are forced by wal_buffers filling will turn into a
> commit-length write when using the new, default open_datasync. Using
> the older default of fdatasync avoids that problem, in return for
> causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC
> writes over fdatasync ones is avoiding the OS cache.
>
> I want to next go through and replicate some of the actual database
> level tests before giving a full opinion on whether this data proves
> it's worth changing the wal_sync_method detection. So far I'm torn
> between whether that's the right approach, or if we should just increase
> the default value for wal_buffers to something more reasonable.
>
> --
> Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
> PostgreSQL Training, Services and Support www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2010-11-17 20:47:31 Re: Query Performance SQL Server vs. Postgresql
Previous Message Scott Carey 2010-11-17 19:26:30 Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?