Re: silent data loss with ext4 / all current versions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-02-24 07:06:10
Message-ID: CAB7nPqTqAk4QfuyJDXNaAhLXKkOofweXc9rocEH1YDZd0kE9ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 24, 2016 at 7:26 AM, Tomas Vondra wrote:
> 1) I'm not quite sure why the patch adds missing_ok to fsync_fname()? The
> only place where we use missing_ok=true is in rename_safe, where right at
> the beginning we do this:
>
> fsync_fname(newfile, false, true);
>
> I.e. we're fsyncing the rename target first, in case it exists. But that
> seems to be conflicting with the comments in xlog.c where we explicitly
> state that the target file should not exist. Why should it be OK to call
> rename_safe() when the target already exists? If this really is the right
> thing to do, it should be explained in the comment above rename_safe(),
> probably.

The point is to mimic rename(), which can manage the case where the
new entry exists, and to look for a consistent on-disk behavior. It is
true that the new argument of fsync_fname is actually not necessary,
we could just check for the existence of the new entry with stat(),
and perform an fsync if it exists.
I have added the following comment in rename_safe():
+ /*
+ * First fsync the old entry and new entry, it this one exists, to ensure
+ * that they are properly persistent on disk. Calling this routine with
+ * an existing new target file is fine, rename() will first remove it
+ * before performing its operation.
+ */
How does that look?

> 2) If rename_safe/link_safe are meant as crash-safe replacements for
> rename/link, then perhaps we should use the same signatures, including the
> "const" pointer parameters. So while currently the signatures look like
> this:
>
> int rename_safe(char *oldfile, char *newfile);
> int link_safe(char *oldfile, char *newfile);
>
> it should probably look like this
>
> int rename_safe(const char *oldfile, const char *newfile);
> int link_safe(const char *oldfile, const char *newfile);
>
> I've noticed this in CheckPointReplicationOrigin() where the current code
> has to cast the parameters to (char*) to silence the compiler.

I recall considering that, and the reason why I did not do so was that
fsync_fname() is not doing it either, because OpenTransientFile() is
using FileName which is not defined as a constant. At the end we'd
finish with one or two casts anyway. So it seems that changing
fsync_fname makes sense though by looking at fsync_fname_ext,
OpenTransientFile() is using an explicit cast for the file name.

> 3) Both rename_safe and link_safe do this at the very end:
>
> fsync_parent_path(newfile);
>
> That however assumes both the oldfile and newfile are placed in the same
> directory - otherwise we'd fsync only one of them. I don't think we have a
> place where we're renaming files between directories (or do we), so we're OK
> with respect to this. But it seems like a good idea to defend against this,
> or at least mention that in the comments.

No, we don't have a code path where a file is renamed between
different directories, which is why this code is doing so. The comment
is a good addition to have though, so I have added it. I guess that we
could complicate more this patch to check if the parent directories of
the new and old entries match or not, then fsync both of them, but I'd
rather keep things simple.

> 4) nitpicking: There are some unnecessary newlines added/removed in
> RemoveXlogFile, XLogArchiveForceDone.

Fixed. I missed those three ones.
--
Michael

Attachment Content-Type Size
xlog-fsync-v6.patch text/x-patch 13.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-02-24 07:29:58 Re: postgres_fdw vs. force_parallel_mode on ppc
Previous Message Robert Haas 2016-02-24 06:23:30 Re: Allow to specify (auto-)vacuum cost limits relative to the database/cluster size?