Re: silent data loss with ext4 / all current versions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2015-11-27 18:01:09
Message-ID: 56589A65.4060201@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/27/2015 02:18 PM, Michael Paquier wrote:
> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> So, what's going on? The problem is that while the rename() is atomic, it's
>> not guaranteed to be durable without an explicit fsync on the parent
>> directory. And by default we only do fdatasync on the recycled segments,
>> which may not force fsync on the directory (and ext4 does not do that,
>> apparently).
>
> Yeah, that seems to be the way the POSIX spec clears things.
> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
> force all currently queued I/O operations associated with the file
> indicated by file descriptor fildes to the synchronized I/O completion
> state. All I/O operations shall be completed as defined for
> synchronized I/O file integrity completion."
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
> If I understand that right, it is guaranteed that the rename() will be
> atomic, meaning that there will be only one file even if there is a
> crash, but that we need to fsync() the parent directory as mentioned.
>
>> FWIW this has nothing to do with storage reliability - you may have good
>> drives, RAID controller with BBU, reliable SSDs or whatever, and you're
>> still not safe. This issue is at the filesystem level, not storage.
>
> The POSIX spec authorizes this behavior, so the FS is not to blame,
> clearly. At least that's what I get from it.

The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
expert), but we should be prepared for the less favorable interpretation
I think.

>
>> I think this issue might also result in various other issues, not just data
>> loss. For example, I wouldn't be surprised by data corruption due to
>> flushing some of the changes in data files to disk (due to contention for
>> shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
>> segment. Also, while I have only seen 1 to 3 segments getting lost, it might
>> be possible that more segments can get lost, possibly making the recovery
>> impossible. And of course, this might cause problems with WAL archiving due
>> to archiving the same
>> segment twice (before and after crash).
>
> Possible, the switch to .done is done after renaming the segment in
> xlogarchive.c. So this could happen in theory.

Yes. That's one of the suspicious places in my notes (haven't posted all
the details, the message was long enough already).

>> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
>> this needs to be backpatched to all backbranches. I've also attached a patch
>> that adds pg_current_xlog_flush_location() function, which proved to be
>> quite useful when debugging this issue.
>
> Agreed. We should be sure as well that the calls to fsync_fname get
> issued in a critical section with START/END_CRIT_SECTION(). It does
> not seem to be the case with your patch.

Don't know. I've based that on code from replication/logical/ which does
fsync_fname() on all the interesting places, without the critical section.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2015-11-27 18:45:51 Re: Redefine default result from PQhost()?
Previous Message Костя Кузнецов 2015-11-27 17:00:36 Re: New gist vacuum.