Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2015-11-29 12:59:22
Message-ID: CAB7nPqRsyUDXizRRD+rJ4=FBu4MwBZtaksc2X+bbL0vs1a6-5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 28, 2015 at 3:01 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> On 11/27/2015 02:18 PM, Michael Paquier wrote:
>>
>> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>> So, what's going on? The problem is that while the rename() is atomic,
>>> it's
>>> not guaranteed to be durable without an explicit fsync on the parent
>>> directory. And by default we only do fdatasync on the recycled segments,
>>> which may not force fsync on the directory (and ext4 does not do that,
>>> apparently).
>>
>>
>> Yeah, that seems to be the way the POSIX spec clears things.
>> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
>> force all currently queued I/O operations associated with the file
>> indicated by file descriptor fildes to the synchronized I/O completion
>> state. All I/O operations shall be completed as defined for
>> synchronized I/O file integrity completion."
>> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
>> If I understand that right, it is guaranteed that the rename() will be
>> atomic, meaning that there will be only one file even if there is a
>> crash, but that we need to fsync() the parent directory as mentioned.
>>
>>> FWIW this has nothing to do with storage reliability - you may have good
>>> drives, RAID controller with BBU, reliable SSDs or whatever, and you're
>>> still not safe. This issue is at the filesystem level, not storage.
>>
>>
>> The POSIX spec authorizes this behavior, so the FS is not to blame,
>> clearly. At least that's what I get from it.
>
>
> The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
> expert),

As I am understanding it, FS implementations are free to decide to
make the rename persist on disk or not.

> but we should be prepared for the less favorable interpretation I
> think.

Yep. I agree. And in case my previous words were not clear, that's the
same line of thought here, we had better cover our backs and study
carefully each code path that could be impacted.

>>> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
>>> sure
>>> this needs to be backpatched to all backbranches. I've also attached a
>>> patch
>>> that adds pg_current_xlog_flush_location() function, which proved to be
>>> quite useful when debugging this issue.
>>
>>
>> Agreed. We should be sure as well that the calls to fsync_fname get
>> issued in a critical section with START/END_CRIT_SECTION(). It does
>> not seem to be the case with your patch.
>
>
> Don't know. I've based that on code from replication/logical/ which does
> fsync_fname() on all the interesting places, without the critical section.

For slot information in slot.c, there will be a PANIC when fsyncing
pg_replslot at some points. It does not seem that weird to do the same
for example after renaming the backup label file..
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-11-29 13:12:29 Re: proposal: PL/Pythonu - function ereport
Previous Message Michael Paquier 2015-11-29 12:42:15 Re: proposal: PL/Pythonu - function ereport