Re: fsync reliability

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync reliability
Date: 2011-04-25 02:06:06
Message-ID: BANLkTinr8+ntSmRZMKZKMFMqiCbX_tqBhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> There's still the "fsync'd a data block but not the directory entry yet"
> issue as fall-out from this too.  Why doesn't PostgreSQL run into this
> problem?  Because the exact code sequence used is this one:
>
> open
> write
> fsync
> close
>
> And Linux shouldn't ever screw that up, or the similar rename path.  Here's
> what the close man page says, from http://linux.die.net/man/2/close :

Theodore Ts'o addresses this *exact* sequence of events, and suggests
if you want that rename to definitely stick that you must fsync the
directory:

http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync

"""
One argument that has commonly been made on the various comment
streams is that when replacing a file by writing a new file and the
renaming “file.new” to “file”, most applications don’t need a
guarantee that new contents of the file are committed to stable store
at a certain point in time; only that either the new or the old
contents of the file will be present on the disk. So the argument is
essentially that the sequence:

fd = open(”foo.new”, O_WRONLY);
write(fd, buf, bufsize);
fsync(fd);
close(fd);
rename(”foo.new”, “foo”);
… is too expensive, since it provides “atomicity and durability”, when
in fact all the application needed was “atomicity” (i.e., either the
new or the old contents of foo should be present after a crash), but
not durability (i.e., the application doesn’t need to need the new
version of foo now, but rather at some intermediate time in the future
when it’s convenient for the OS).

This argument is flawed for two reasons. First of all, the squence
above exactly provides desired “atomicity without durability”. It
doesn’t guarantee which version of the file will appear in the event
of an unexpected crash; if the application needs a guarantee that the
new version of the file will be present after a crash, ***it’s
necessary to fsync the containing directory***
"""

Emphasis mine.

So, all in all, I think the creation of, deletion of, and renaming of
files in the write ahead log area should be followed by a pg_xlog
fsync. I think it is also necessary to fsync directories in the
cluster directory at checkpoint time, also: if a chunk of directory
metadata doesn't make it to disk, a checkpoint occurs, and then
there's a crash then it's possible that replaying the WAL
post-checkpoint won't create/move/delete the file in the cluster.

The fact this hasn't been happening (or hasn't triggered an error,
which would be scarier) may just be a happy accident of that data
being flushed most of the time, meaning that that fsync() on the
directory file descriptor won't cost very much anyway.

--
fdr

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dan Ports 2011-04-25 03:33:08 Re: SSI non-serializable UPDATE performance
Previous Message Greg Stark 2011-04-25 01:36:01 Re: Unlogged tables, persistent kind