Fwd: Apple Darwin disabled fsync?

From: Peter Bierman <bierman(at)apple(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-20 02:43:14
Message-ID: a06010200be3da9564694@[17.202.21.231]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>Date: Sat, 19 Feb 2005 17:59:21 -0800
>From: Dominic Giampaolo <dbg(at)apple(dot)com>
>Subject: Re: bad fsync? (A.M.)
>To: darwin-dev(at)lists(dot)apple(dot)com
>
>>MySQL makes the following claim at:
>>http://dev.mysql.com/doc/mysql/en/news-4-1-9.html
>>
>>"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3
>>and up. Apple had disabled fsync() in Mac OS X for internal disk
>>drives, which caused corruption at power outages."
>>
>>First of all, is this accurate? A pointer to some docs or a tech note
>>on this would be helpful.
>>
>The comments about fsync() are wrong...
>
>On MacOS X, fsync() always has and always will flush all file data
>from host memory to the drive on which the file resides. The behavior
>of fsync() on MacOS X is the same as it is on every other version of
>Unix since the dawn of time (well, since the introduction of fsync
>anyway :-).
>
>I believe that what the above comment refers to is the fact that
>fsync() is not sufficient to guarantee that your data is on stable
>storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
>to ask the drive to flush all buffered data to stable storage.
>
>Let me explain in more detail. With fsync() even though the OS
>writes the data through to the disk and the disk says "yes I wrote
>the data", the data is not actually on permanent storage. Unless
>you explicitly disable it, all disks have a write buffer which holds
>data you've written. The disk buffers the data you wrote until it
>decides to flush it to the platters (and the writes may not be in
>the order you wrote them). If you lose power or the system crashes
>before the data is written, you can wind up in a situation where only
>some of your data is actually on disk. What is worse is that even if
>you write blocks A, B and C, call fsync() and then write block D you
>may find after rebooting that blocks A and D are on disk but B and C
>are not (in fact any ordering of A, B, C, and D is possible).
>
>While this may seem like a rare case it is not. In fact if you sit
>down and pull the plug on a system you can make it happen in one or
>two plug pulls. I have even gone so far as to watch this behavior
>with a logic analyzer on the ATA bus: I saw the data for two writes
>come across the ATA cable, the drive replied and said the writes were
>successful and then when we rebooted the data from the second write
>was correct on disk but the data from the first write was not.
>
>To deal with this we introduced the F_FULLFSYNC fcntl which will ask
>the drive to flush all of its buffered data to disk. When an app
>needs to guarantee that data is on disk it should use F_FULLFSYNC.
>In most cases you do not need such a heavy handed operation and
>fsync() is good enough. But in an app like a database, it is
>essential if you want transactional integrity.
>
>Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC
>with the FLUSH_TRACK_CACHE command. All drives sold by Apple will
>honor this command. Unfortunately quite a few firewire drive vendors
>disable this command and do not pass it to the drive. This means that
>most external firewire drives are not reliable if you lose power or
>the system crashes. We can't work-around that unless we ask the drive
>to disable the write cache completely (which hurts performance quite
>badly -- and even that may not be enough as some drives will ignore
>that request too).
>
>So in summary, I believe that the comments in the MySQL news posting
>are slightly confused. On MacOS X fsync() behaves the same as it does
>on all Unices. That's not good enough if you really care about data
>integrity and so we also provide the F_FULLFSYNC fcntl. As far as I
>know, MacOS X is the only OS to provide this feature for apps that
>need to truly guarantee their data is on disk.
>
>Hope this clears things up.
>
>--dominic

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Treat 2005-02-20 03:38:51 Re: Get rid of system attributes in pg_attribute?
Previous Message Mark Kirkwood 2005-02-20 01:41:35 Re: Data loss, vacuum, transaction wrap-around