Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-01-22 05:45:57
Message-ID: CAB7nPqQ8Am-b8Ttky=BqV9rbhnyVPAtaF2N6omXoiNvtAAKnGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 19, 2016 at 4:20 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>
> On 01/19/2016 08:03 AM, Michael Paquier wrote:
>>
>> On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>>
> ...
>>>>
>>>> Tomas, I am planning to have a look at that, because it seems to be
>>>> important. In case it becomes lost on my radar, do you mind if I add
>>>> it to the 2016-03 CF?
>>>
>>>
>>>
>>> Well, what else can I do? I have to admit I'm quite surprised this is
>>> still
>>> rotting here, considering it addresses a rather serious data loss /
>>> corruption issue on pretty common setup.
>>
>>
>> Well, I think you did what you could. And we need to be sure now that
>> it gets in and that this patch gets a serious lookup. So for now my
>> guess is that not loosing track of it would be a good first move. I
>> have added it here to attract more attention:
>> https://commitfest.postgresql.org/9/484/
>
>
> Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I added
> it into the CF app).

So, I have been playing with a Linux VM with VMware Fusion and on ext4
with data=ordered the renames are getting lost if the root folder is
not fsync. By killing-9 the VM I am able to reproduce that really
easily.

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.
3) In exitArchiveRecovery if the rename is not made durable I think it
does not matter much. Even if recovery.conf is the one present once
the node restarts node would move back again to recovery, and actually
we had better move back to recovery in this case, no?
4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.
5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?
6) For the one after XLogArchiveNotify() for the last partial segment
of the old timeline, it doesn't really matter to not make the change
persistent as this is mainly done because it is useful to identify
that it is a partial segment.
7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename was
not effective.
2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).
2) In writeTimeLineHistoryFile, similarly we don't need to care much,
in WalRcvFetchTimeLineHistoryFiles recovery would take again the same
path

Thoughts?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vitaly Burovoy 2016-01-22 06:03:07 Re: custom function for converting human readable sizes to bytes
Previous Message Noah Misch 2016-01-22 05:07:22 Re: Releasing in September