Re: fsync reliability

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync reliability
Date: 2011-04-22 03:51:49
Message-ID: 4DB0FB55.5010102@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/21/2011 04:26 AM, Simon Riggs wrote:
> However, that begs the question of what happens with WAL. At present,
> we do nothing to ensure that "the entry in the directory containing
> the file has also reached disk".
>

Well, we do, but it's not obvious why that is unless you've stared at
this for far too many hours. A clear description of the possible issue
you and Dan are raising showed up on LKML a few years ago:
http://lwn.net/Articles/270891/

Here's the most relevant part, which directly addresses the WAL case:

"[fsync] is unsafe for write-ahead logging, because it doesn't really
guarantee any _ordering_ for the writes at the hard storage level. So
aside from losing committed data, it can also corrupt structural
metadata. With ext3 it's quite easy to verify that fsync/fdatasync
don't always write a journal entry. (Apart from looking at the kernel
code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats. If the current mtime second _hasn't_ changed, the
inode isn't written. If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20."

There's a terrible hack suggested where you run fchmod to force the
journal out in the next fsync that makes me want to track the poster
down and shoot him, but this part raises a reasonable question.

The main issue he's complaining about here is a moot one for
PostgreSQL. If the WAL rewrites have been reordered but have not
completed, the minute WAL replay hits the spot with a missing block the
CRC32 will be busted and replay is finished. The fact that he's
assuming a database would have such a naive WAL implementation that it
would corrupt the database if blocks are written out of order in between
fsync call returning is one of the reasons this whole idea never got
more traction--hard to get excited about a proposal whose fundamentals
rest on an assumption that doesn't turns out to be true on real databases.

There's still the "fsync'd a data block but not the directory entry yet"
issue as fall-out from this too. Why doesn't PostgreSQL run into this
problem? Because the exact code sequence used is this one:

open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path.
Here's what the close man page says, from
http://linux.die.net/man/2/close :

"A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes. It is not
common for a filesystem to flush the buffers when the stream is closed.
If you need to be sure that the data is physically stored use fsync(2).
(It will depend on the disk hardware at this point.)"

What this is alluding to is that if you fsync before closing, the close
will write all the metadata out too. You're busted if your write cache
lies, but we already know all about that issue.

There was a discussion of issues around this on LKML a few years ago,
with Alan Cox getting the good pull quote at
http://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows the
user to correctly indicate their requirements." While fsync doesn't
guarantee that metadata is written out, and neither does close, kernel
developers seem to all agree that fsync-before-close means you want
everything on disk. Filesystems that don't honor that will break all
sorts of software.

It is of course possible there are bugs in some part of this code path,
where a clever enough test case might expose a window of strange
file/metadata ordering. I think it's too weak of a theorized problem to
go specifically chasing after though.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2011-04-22 06:23:00 Re: pgbench \for or similar loop
Previous Message Pavel Stehule 2011-04-22 03:42:07 Re: "stored procedures"