Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-01-22 12:32:29
Message-ID: CAB7nPqQ_y+hymFfe6z9sfHHuajhB8Y1oQSEf5xW-aseBEaoN2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>> Here are some comments about your patch after a look at the code.
>>
>> Regarding the additions in fsync_fname() in xlog.c:
>> 1) In InstallXLogFileSegment, rename() will be called only if
>> HAVE_WORKING_LINK is not used, which happens only on Windows and
>> cygwin. We could add it for consistency, but it should be within the
>> #else/#endif block. It is not critical as of now.
>> 2) The call in RemoveXlogFile is not necessary, the rename happening
>> only on Windows.
>
> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
> there be some other platforms? And are we sure the file systems on those
> platforms are safe without the fsync call?
> That is, while the report references ext4, there may be other file systems
> with the same problem - ext4 was used mostly as it's the most widely used
> Linux file system.

From pg_config_manual.h:
#if !defined(WIN32) && !defined(__CYGWIN__)
#define HAVE_WORKING_LINK 1
#endif
If we want to be consistent with what Posix proposes, I am not against
adding it.

>> 3) In exitArchiveRecovery if the rename is not made durable I think
>> it does not matter much. Even if recovery.conf is the one present
>> once the node restarts node would move back again to recovery, and
>> actually we had better move back to recovery in this case, no?
>
> I'm strongly against this "optimization" - I'm more than happy to exchange
> the one fsync for not having to manually fix the database after crash.
>
> I don't really see why switching back to recovery should be desirable in
> this case? Imagine you have a warm/hot standby, and that you promote it to
> master. The clients connect, start issuing commands and then the system
> crashes and loses the recovery.conf rename. The system reboots, database
> performs local recovery but then starts as a standby and starts rejecting
> writes. That seems really weird to me.
>> 4) StartupXLOG for the tablespace map. I don't think this one is
>> needed as well. Even if the tablespace map is not removed after a
>> power loss user would get an error telling that the file should be
>> removed.
>
> Please no, for the same reasons as in (3).
>
>> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the
>> fsync, we guarantee that there is no need to do again the recovery.
>> But in case of a power loss, isn't it better to do the recovery again?
>
> Why would it be better? Why should we do something twice when we don't have
> to? Had this not be reliable, then the whole recovery process is
> fundamentally broken and we better fix it instead of merely putting a
> band-aid on it.

Group shot with 3), 4) and 5). Well, there is no data loss here,
putting me in the direction of considering this addition of an fsync
as an optimization and not a bug.

>> 6) For the one after XLogArchiveNotify() for the last partial
>> segment of the old timeline, it doesn't really matter to not make the
>> change persistent as this is mainly done because it is useful to
>> identify that it is a partial segment.
>
> OK, although I still don't quite see why that should be a reason not to do
> the fsync. It's not really going to give us any measurable performance
> advantage (how often we do those fsyncs), so I'd vote to keep it and make
> sure the partial segments are named accordingly. Less confusion is always
> better.

Check.

>> 7) In CancelBackup, this one is not needed as well I think. We would
>> surely want to get back to recovery if those files remain after a
>> power loss.
>
> I may be missing something, but why would we switch to recovery in this
> case?

>> For the ones in xlogarchive.c:
>> 1) For KeepFileRestoredFromArchive, it does not matter here, we are
>> renaming a file with a temporary name to a permanent name. Once the
>> node restarts, we would see an extra temporary file if the rename
>> was not effective.
>
> So we'll lose the segment (won't have it locally under the permanent name),
> as we've already restored it and won't do that again. Is that really a good
> thing to do?

At this point if a segment is restored from archive and there is a
power loss we are going back to recovery. The advantage of having the
fsync would ensure that the segment is not fetched twice.

>> 2) XLogArchiveForceDone, the only bad thing that would happen here is
>> to leave behind a .ready file instead of a .done file. I guess that we
>> could have it though as an optimization to not have to archive again
>> this file.
>
> Yes.
>
>>
>> For the one in pgarch.c:
>> 1) In pgarch_archiveDone, we could add it as an optimization to
>> actually let the server know that the segment has been already
>> archived, preventing a retry.
>
> Not sure what you mean by "could add it as an optimization"?

In case of a loss of the rename, server would retry archiving it. So
adding an fsync here ensures that the segment is marked as .done for
good.

>> Thoughts?
>
> Thanks for the review and comments. I think the question is whether we only
> want to do the additional fsync() only when it ultimately may lead to data
> loss, or even in cases where it may cause operational issues (e.g. switching
> back to recovery needlessly).
> I'd vote for the latter, as I think it makes the database easier to operate
> (less manual interventions) and the performance impact is 0 (as those fsyncs
> are really rare).

My first line of thoughts after looking at the patch is that I am not
against adding those fsync calls on HEAD as there is roughly an
advantage to not go back to recovery in most cases and ensure
consistent names, but as they do not imply any data loss I would not
encourage a back-patch. Adding them seems harmless at first sight I
agree, but those are not actual bugs.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2016-01-22 12:34:48 Re: count_nulls(VARIADIC "any")
Previous Message David Rowley 2016-01-22 11:13:38 Re: Parallel Aggregate