Re: Raid 10 chunksize

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Stef Telford <stef(at)ummon(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Raid 10 chunksize
Date: 2009-04-02 20:34:13
Message-ID: C5FA6F55.41EF%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 4/2/09 1:53 AM, "Greg Smith" <gsmith(at)gregsmith(dot)com> wrote:

> On Wed, 1 Apr 2009, Scott Carey wrote:
>
>> Write caching on SATA is totally fine. There were some old ATA drives that
>> when paried with some file systems or OS's would not be safe. There are
>> some combinations that have unsafe write barriers. But there is a standard
>> well supported ATA command to sync and only return after the data is on
>> disk. If you are running an OS that is anything recent at all, and any
>> disks that are not really old, you're fine.
>
> While I would like to believe this, I don't trust any claims in this area
> that don't have matching tests that demonstrate things working as
> expected. And I've never seen this work.
>
> My laptop has a 7200 RPM drive, which means that if fsync is being passed
> through to the disk correctly I can only fsync <120 times/second. Here's
> what I get when I run sysbench on it, starting with the default ext3
> configuration:
>
> $ uname -a
> Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC
> 2009 i686 GNU/Linux
>
> $ mount
> /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro)
>
> $ sudo hdparm -I /dev/sda | grep FLUSH
> * Mandatory FLUSH_CACHE
> * FLUSH_CACHE_EXT
>
> $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1
> --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run
> sysbench v0.4.8: multi-threaded system evaluation benchmark
>
> Running the test with following options:
> Number of threads: 1
>
> Extra file open flags: 0
> 1 files, 16Kb each
> 16Kb total file size
> Block size 16Kb
> Number of random requests for random IO: 10000
> Read/Write ratio for combined random IO test: 1.50
> Periodic FSYNC enabled, calling fsync() each 1 requests.
> Calling fsync() at the end of test, Enabled.
> Using synchronous I/O mode
> Doing random write test
> Threads started!
> Done.
>
> Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total
> Read 0b Written 156.25Mb Total transferred 156.25Mb (39.176Mb/sec)
> 2507.29 Requests/sec executed
>
>
> OK, that's clearly cached writes where the drive is lying about fsync.
> The claim is that since my drive supports both the flush calls, I just
> need to turn on barrier support, right?
>
> [Edit /etc/fstab to remount with barriers]
>
> $ mount
> /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1)
>
> [sysbench again]
>
> 2612.74 Requests/sec executed
>
> -----
>
> This is basically how this always works for me: somebody claims barriers
> and/or SATA disks work now, no really this time. I test, they give
> answers that aren't possible if fsync were working properly, I conclude
> turning off the write cache is just as necessary as it always was. If you
> can suggest something wrong with how I'm testing here, I'd love to hear
> about it. I'd like to believe you but I can't seem to produce any
> evidence that supports you claims here.

Your data looks good, and puts a lot of doubt on my previous sources of
info.
So I did more research, it seems that (most) drives don't lie, your OS and
file system do (or sometimes drive drivers or raid card). I know LVM and MD
and other Linux block remapping layer things break write barriers as well.
Apparently ext3 doesn't implement fsync with a write barrier or cache flush.
Linux kernel mailing lists implied that 2.6 had fixed these, but apparently
not. Write barriers were fixed, but not fsync. Even more confusing, it
looks like the behavior in some linux versions that are highly patched and
backported (SUSE, RedHat, mostly) may behave differently than those closer
to the kernel trunk like Ubuntu.

If you can, try xfs with write barriers on. I'll try some tests using FIO
(not familiar with sysbench but looks easy too) with various file systems
and some SATA and SAS/SCSI setups when I get a chance.

A lot of my prior evidence came from the linux kernel list and other places
where I trusted the info over the years. I'll dig up more. But here is what
I've learned in the past plus a bit from today:
Drives don't lie anymore, and write barrier and lower level ATA commands
just work. Linux fixed write barrier support in kernel 2.5.
Several OS's do things right and many don't with respect to fsync. I had
thought linux did fix this but it turns out they only fixed write barriers
and left fsync broken:
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024/thread

In your tests the barriers slowed things down a lot, so something is working
right there. From what I can see, with ext3 metadata changes cause much
more frequent write barrier activity, so 'relatime' and 'noatime' actually
HURT your data integrity as a side effect of fsync not guaranteeing what you
think it does.

The big one, is this quote from the linux kernel list:
" Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync(). Considering how much Linux
is used for critical databases, using these functions, this amazes me.
"

Check this full post out that started that thread:
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024

I admit that it looks like I'm pretty wrong for Linux with ext3 at the
least.
Linux is often not safe with disk write caches because its fsync() call
doesn't flush the cache. The root problem, is not the drives, its linux /
ext3. Its write-barrier support is fine now (if you don't go through LVM or
MD which don't support it), but fsync does not guarantee anything other than
the write having left the OS and gone to the device. In fact POSIX fsync(2)
doesn't require that the data is on disk. Interestingly, postgres would be
safer on linux if it used sync_file_range instead of fsync() but that has
other drawbacks and limitations -- and is broken by use of LVM or MD.
Currently, linux + ext3 + postgres, you are only guaranteed when fsync()
returns that the data has left the OS, not that it is on a drive -- SATA or
SAS. Strangely, sync_file_range() is safer than fsync() in the presence of
any drive cache at all (including battery backed raid card failure) because
it at least enforces write barriers.

Fsync + SATA write cache is safe on Solaris with ZFS, but not Solaris with
UFS (the file system is write barrier and cache aware for the former and not
the latter).

Linux (a lot) and Postgres (a little) can learn from some of the ZFS
concepts with regard to atomicity of changes and checksums on data and
metadata. Much of the above issues would simply not exist in the presence
of good checksum use. Ext4 has journal segment checksums, but no metadata
or data checksums exist for ability to detect partial writes to anything but
the journal. Postgres is adding checksums on data, and is already
essentially copy-on-write for MVCC which is awesome -- are xlog writes
protected by checksums? Accidental out-of-order writes become an issue that
can be dealt with in a log or journal that has checksums even in the
presence of OS and File Systems that don't have good guarantees for fsync
like Linux + ext3. Postgres could make itself safe even if drive write
cache is enabled, fsync lies, AND there is a power failure. If I'm not
mistaken, block checksums on data + xlog entry checksums can make it very
difficult to corrupt even if fsync is off (though data writes happening
before xlog writes are still bad -- that would require external-to-block
checksums --like zfs -- to fix)!

http://lkml.org/lkml/2005/5/15/85

Where the "disks lie to you" stuff probably came from:
http://hardware.slashdot.org/article.pl?sid=05/05/13/0529252&tid=198&tid=128
(turns out its the OS that isn't flushing the cache on fsync).

http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache
_on_journaled_filesystems.3F
So if xfs fsync has a barrier, its safe with either:
Raw device that respects cache flush + write caching on.
OR
Battery backed raid card + drive write caching off.

Xfs fsync supposedly works right (need to test) but fdatasync() does not.

What this really boils down to is that POSIX fsync does not provide a
guarantee that the data is on disk at all. My previous comments are wrong.
This means that fsync protects you from OS crashes, but not power failure.
It can do better in some systems / implementations.

>
> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2009-04-02 20:44:20 Re: Raid 10 chunksize
Previous Message Merlin Moncure 2009-04-02 20:27:06 Re: Raid 10 chunksize