Re: silent data loss with ext4 / all current versions

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-03-04 22:35:30
Message-ID: 20160304223530.lc5as4wb3tuhqie5@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:
> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Hi,
>
> Thanks for the review.
>
> >> +/*
> >> + * rename_safe -- rename of a file, making it on-disk persistent
> >> + *
> >> + * This routine ensures that a rename file persists in case of a crash by using
> >> + * fsync on the old and new files before and after performing the rename so as
> >> + * this categorizes as an all-or-nothing operation.
> >> + */
> >> +int
> >> +rename_safe(const char *oldfile, const char *newfile)
> >> +{
> >> + struct stat filestats;
> >> +
> >> + /*
> >> + * First fsync the old entry and new entry, it this one exists, to ensure
> >> + * that they are properly persistent on disk. Calling this routine with
> >> + * an existing new target file is fine, rename() will first remove it
> >> + * before performing its operation.
> >> + */
> >> + fsync_fname(oldfile, false);
> >> + if (stat(newfile, &filestats) == 0)
> >> + fsync_fname(newfile, false);
> >
> > I don't think we want any stat()s here. I'd much, much rather check open
> > for ENOENT.
>
> OK. So you mean more or less that, right?
> int fd;
> fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
> if (fd < 0)
> {
> if (errno != ENOENT)
> return -1;
> }
> else
> {
> pg_fsync(fd);
> CloseTransientFile(fd);
> }

Yes. Otherwise the check is racy: The file could be gone by the time you
do the fsync; leading to a spurious ERROR (which often would get
promoted to a PANIC).

> >> +/*
> >> + * link_safe -- make a file hard link, making it on-disk persistent
> >> + *
> >> + * This routine ensures that a hard link created on a file persists on system
> >> + * in case of a crash by using fsync where on the link generated as well as on
> >> + * its parent directory.
> >> + */
> >> +int
> >> +link_safe(const char *oldfile, const char *newfile)
> >> +{
> >
> > If we go for a new abstraction here, I'd rather make it
> > 'replace_file_safe' or something, and move the link/rename code #ifdef
> > into it.
>
> Hm. OK. I don't see any reason why switching to link() even in code
> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
> be a problem thinking about it. Should HAVE_WORKING_LINK be available
> on a platform we can combine it with unlink. Is that in line with what
> you think?

I wasn't trying to suggest we should replace all rename codepaths with
the link wrapper, just the ones that already have a HAVE_WORKING_LINK
check. The name of the routine I suggested is bad though...

> >> + if (link(oldfile, newfile) < 0)
> >> + return -1;
> >> +
> >> + /*
> >> + * Make the link persistent in case of an OS crash, the new entry
> >> + * generated as well as its parent directory need to be flushed.
> >> + */
> >> + fsync_fname(newfile, false);
> >> +
> >> + /*
> >> + * Same for parent directory. This routine is never called to rename
> >> + * files across directories, but if this proves to become the case,
> >> + * flushing the parent directory if the old file would be necessary.
> >> + */
> >> + fsync_parent_path(newfile);
> >> + return 0;
> >
> > I think it's a seriously bad idea to encode that knowledge in such a
> > general sounding routine. We could however argue that this is about
> > safely replacing the *target* file; not about safely removing the old
> > file.
>
> Not sure I am following here. Are you referring to the fact that if
> the new file and old file are on different directories would make this
> routine unreliable?

Yes.

> Because yes that's the case if we want to make both of them
> persistent, and I think we want to do so.

That's one way.

> Do you suggest to correct this comment to remove the mention to the
> old file's parent directory because we just care about having the new
> file as being persistent?

That's one approach, yes. Combined with the fact that you can't actually
reliably rename across directories, the two could be on different
filesystems after all, that'd be a suitable defense. It just needs to be
properly documented in the function header, not at the bottom.

Regards,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-03-04 22:37:48 Re: silent data loss with ext4 / all current versions
Previous Message Michael Paquier 2016-03-04 22:34:24 Re: Publish autovacuum informations