Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-02-02 07:20:12
Message-ID: CAB7nPqTOnCJQH0Kkk6tawn2MT0JtsCzNG+PkUz8Tj6vGQkwfcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 2, 2016 at 1:07 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2016-01-25 16:30:47 +0900, Michael Paquier wrote:
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index a2846c4..b124f90 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
>> tmppath, path)));
>> return false;
>> }
>> +
>> + /*
>> + * Make sure the rename is permanent by fsyncing the parent
>> + * directory.
>> + */
>> + START_CRIT_SECTION();
>> + fsync_fname(XLOGDIR, true);
>> + END_CRIT_SECTION();
>> #endif
>
> Hm. I'm seriously doubtful that using critical sections for this is a
> good idea. What's the scenario you're protecting against by declaring
> this one? We intentionally don't error out in the isdir cases in
> fsync_fname() cases anyway.
>
> Afaik we need to fsync tmppath before and after the rename, and only
> then the directory, to actually be safe.

Regarding the fsync call on the new file before the rename, would it
be better to extend fsync_fname() with some kind of noerror flag to
work around the case of a file that does not exist or do you think it
is better just to use pg_fsync in this case after getting an fd? Using
directly pg_fsync() looks redundant with what fsync_fname does
already.

>> @@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
>> errmsg("could not rename file \"%s\" to \"%s\": %m",
>> RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
>>
>> + /* Make sure the rename is permanent by fsyncing the data directory. */
>> + fsync_fname(".", true);
>> +
>
> Shouldn't RECOVERY_COMMAND_DONE be fsynced first here?

Check.

>> /*
>> @@ -6525,6 +6550,13 @@ StartupXLOG(void)
>> TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
>> }
>>
>> + /*
>> + * Make sure the rename is permanent by fsyncing the parent
>> + * directory.
>> + */
>> + if (haveBackupLabel || haveTblspcMap)
>> + fsync_fname(".", true);
>> +
>
> Isn't that redundant with the haveTblspcMap case above?

I am not sure I get your point here. Are you referring to the fact
that fsync should be done after each rename in this case?

>> /* Check that the GUCs used to generate the WAL allow recovery */
>> CheckRequiredParameterValues();
>>
>> @@ -7305,6 +7337,12 @@ StartupXLOG(void)
>> errmsg("could not rename file \"%s\" to \"%s\": %m",
>> origpath, partialpath)));
>> XLogArchiveNotify(partialfname);
>> +
>> + /*
>> + * Make sure the rename is permanent by fsyncing the parent
>> + * directory.
>> + */
>> + fsync_fname(XLOGDIR, true);
>
> .partial should be fsynced first.

Check.

>> }
>> }
>> }
>> @@ -10905,6 +10943,9 @@ CancelBackup(void)
>> BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
>> TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
>> }
>> +
>> + /* fsync the data directory to persist the renames */
>> + fsync_fname(".", true);
>> }
>
> Same.

Re-check.

>> /*
>> diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
>> index 277c14a..8dda80b 100644
>> --- a/src/backend/access/transam/xlogarchive.c
>> +++ b/src/backend/access/transam/xlogarchive.c
>> @@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
>> path, xlogfpath)));
>>
>> /*
>> + * Make sure the renames are permanent by fsyncing the parent
>> + * directory.
>> + */
>> + fsync_fname(XLOGDIR, true);
>
> Afaics the file under the temporary filename has not been fsynced up to
> here.

Yes, true, the old file...

>> + /*
>> * Create .done file forcibly to prevent the restored segment from being
>> * archived again later.
>> */
>> @@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog)
>> errmsg("could not rename file \"%s\" to \"%s\": %m",
>> archiveReady, archiveDone)));
>>
>> + /*
>> + * Make sure the rename is permanent by fsyncing the parent
>> + * directory.
>> + */
>> + fsync_fname(XLOGDIR "/archive_status", true);
>> return;
>> }
>
> Afaics the AllocateFile() call below is not protected at all, no?

Yep.

> How about introducing a 'safe_rename()' that does something roughly akin
> to:
> fsync(oldname);
> fsync(fname) || true;
> rename(oldfname, fname);
> fsync(fname);
> fsync(basename(fname));
>
> I'd rather have that kind of logic somewhere once, instead of repeated a
> dozen times...

Not wrong, and this leads to the following:
void rename_safe(const char *old, const char *new, bool isdir, int elevel);
Controlling elevel is necessary per the multiple code paths that would
use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
that look fine?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-02-02 08:11:25 Re: Freeze avoidance of very large table.
Previous Message Andreas Joseph Krogh 2016-02-02 07:18:37 Re: [PATCH] Phrase search ported to 9.6