Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2015-11-27 13:18:23
Message-ID: CAB7nPqQvVg7X4Nh+B95AqYSgiub8r=Ce8Q7oRPkgEphCkpPZ=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> So, what's going on? The problem is that while the rename() is atomic, it's
> not guaranteed to be durable without an explicit fsync on the parent
> directory. And by default we only do fdatasync on the recycled segments,
> which may not force fsync on the directory (and ext4 does not do that,
> apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

> FWIW this has nothing to do with storage reliability - you may have good
> drives, RAID controller with BBU, reliable SSDs or whatever, and you're
> still not safe. This issue is at the filesystem level, not storage.

The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.

> I've done the same tests on xfs and that seems to be safe - I've been unable
> to reproduce the issue, so either the issue is not there or it's more
> difficult to hit it. I haven't tried on other file systems, because ext4 and
> xfs cover vast majority of deployments (at least on Linux), and thus issue
> on ext4 is serious enough I believe.
>
> So pretty much everyone running on Linux + ext3/ext4 is vulnerable.
>
> It's also worth mentioning that the data is not actually lost - it's
> properly fsynced in the WAL segments, it's just the rename that got lost. So
> it's possible to survive this without losing data by manually renaming the
> segments, but this must happen before starting the cluster because the
> automatic recovery comes and discards all the data etc.

Hm. Most users are not going to notice that, particularly where things
are embedded.

> I think this issue might also result in various other issues, not just data
> loss. For example, I wouldn't be surprised by data corruption due to
> flushing some of the changes in data files to disk (due to contention for
> shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
> segment. Also, while I have only seen 1 to 3 segments getting lost, it might
> be possible that more segments can get lost, possibly making the recovery
> impossible. And of course, this might cause problems with WAL archiving due
> to archiving the same
> segment twice (before and after crash).

Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.

> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
> this needs to be backpatched to all backbranches. I've also attached a patch
> that adds pg_current_xlog_flush_location() function, which proved to be
> quite useful when debugging this issue.

Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.

> And finally, I've done a quick review of all places that might suffer the
> same issue - some are not really interesting as the stuff is ephemeral
> anyway (like pgstat for example), but there are ~15 places that may need
> this fix:
>
> * src/backend/access/transam/timeline.c (2 matches)
> * src/backend/access/transam/xlog.c (9 matches)
> * src/backend/access/transam/xlogarchive.c (3 matches)
> * src/backend/postmaster/pgarch.c (1 match)
>
> Some of these places might be actually safe because a fsync happens
> somewhere immediately after the rename (e.g. in a caller), but I guess
> better safe than sorry.

I had a quick look at those code paths and indeed it would be safer to
be sure that once rename() is called we issue those fsync calls.

> I plan to do more power failure testing soon, with more complex test
> scenarios. I suspect there might be other similar issues (e.g. when we
> rename a file before a checkpoint and don't fsync the directory - then the
> rename won't be replayed and will be lost).

That would be great.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2015-11-27 13:28:32 Re: silent data loss with ext4 / all current versions
Previous Message Teodor Sigaev 2015-11-27 12:35:06 Re: silent data loss with ext4 / all current versions