Re: silent data loss with ext4 / all current versions

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, dpage(at)pgadmin(dot)org, Guillaume Lelarge <guillaume(dot)lelarge(at)dalibo(dot)com>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-03-27 23:25:09
Message-ID: 20160327232509.v5wgac5vskusedin@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:
> +/*
> + * Sync data directory to ensure that what has been generated up to now is
> + * persistent in case of a crash, and this is done once globally for
> + * performance reasons as sync requests on individual files would be
> + * a negative impact on the running time of pg_rewind. This is invoked at
> + * the end of processing once everything has been processed and written.
> + */
> +static void
> +syncTargetDirectory(const char *argv0)
> +{
> + int ret;
> + char exec_path[MAXPGPATH];
> + char cmd[MAXPGPATH];
> +
> + if (dry_run)
> + return;

I think it makes more sense to return after detecting the binary, so
you'd find out about problems around not finding initdb during the dry
run, not later.

> + /* Grab and invoke initdb to perform the sync */
> + if ((ret = find_other_exec(argv0, "initdb",
> + "initdb (PostgreSQL) " PG_VERSION "\n",
> + exec_path)) < 0)
> + {
> + char full_path[MAXPGPATH];
> +
> + if (find_my_exec(argv0, full_path) < 0)
> + strlcpy(full_path, progname, sizeof(full_path));
> +
> + if (ret == -1)
> + pg_fatal("The program \"initdb\" is needed by %s but was \n"
> + "not found in the same directory as \"%s\".\n"
> + "Check your installation.\n", progname, full_path);
> + else
> + pg_fatal("The program \"postgres\" was found by \"%s\" but was \n"
> + "not the same version as %s.\n"
> + "Check your installation.\n", progname, full_path);

Wrong binary name.

> + }
> +
> + /* now run initdb */
> + if (debug)
> + snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S",
> + exec_path, datadir_target);
> + else
> + snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"",
> + exec_path, datadir_target, DEVNULL);
> +
> + if (system(cmd) != 0)
> + pg_fatal("sync of target directory with initdb -S failed\n");
> + pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n");
> +}

Don't see need for emitting "done", for now at least.

> /*
> + * Wrapper of rename() similar to the backend version with the same function
> + * name aimed at making the renaming durable on disk. Note that this version
> + * does not fsync the old file before the rename as all the code paths leading
> + * to this function are already doing this operation. The new file is also
> + * normally not present on disk before the renaming so there is no need to
> + * bother about it.

I don't think it's a good idea to skip fsyncing the old file based on
that; it's way too likely that that'll not be done for the next user of
durable_rename.

> + */
> +static int
> +durable_rename(const char *oldfile, const char *newfile)
> +{
> + int fd;
> + char parentpath[MAXPGPATH];
> +
> + if (rename(oldfile, newfile) != 0)
> + {
> + /* durable_rename produced a log entry */

Uh?

> + fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
> + progname, current_walfile_name, strerror(errno));

current_walfile_name doesn't look right, that's a largely independent
global variable.

> + if (fsync(fd) != 0)
> + {
> + close(fd);
> + fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> + progname, newfile, strerror(errno));
> + return -1;

Close should be after the strerror() call (yes, that mistake already
existed once in receivelog.c).

> + fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
> +
> + /*
> + * Some OSs don't allow us to open directories at all (Windows returns
> + * EACCES), just ignore the error in that case. If desired also silently
> + * ignoring errors about unreadable files. Log others.
> + */

Comment is not applicable as a whole.

> + if (fd < 0 && (errno == EISDIR || errno == EACCES))
> + return 0;
> + else if (fd < 0)
> + {
> + fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
> + progname, parentpath, strerror(errno));
> + return -1;
> + }
> +
> + if (fsync(fd) != 0)
> + {
> + close(fd);
> + fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> + progname, parentpath, strerror(errno));
> + return -1;

close() needs to be moved again.

It'd be easier to apply these if the rate of trivial issues were lower
:(.

Attached is an heavily revised version of the patch. Besides the above,
I've:
* copied fsync_fname_ext from initdb, I think it's easier to understand
code this way, and it'll make unifying the code easier
* added fsyncing of the directory in mark_file_as_archived()
* The WAL files also need to be fsynced when created in open_walfile():
- otherwise the directory entry itself isn't safely persistent, as we
don't fsync the directory in the stream->synchronous fsync() cases.
- we refuse to resume in open_walfile(), if a file isn't 16MB when
restarting. Without an fsync that's actually not unlikely after a
crash. Even with an fsync that's not guaranteed not to happen, but
the chance of it is much lower.

I'm too tired to push this at the eleventh hour. Besides a heavily
revised patch, backpatching will likely include a number of conflicts.
If somebody in the US has the energy to take care of this...

I've also noticed that

a) pg_basebackup doesn't do anything about durability (it probably needs
a very similar patch to the one pg_rewind just received).
b) nor does pg_dump[all]

I think it's pretty unacceptable for backup tools to be so cavalier
about durability.

So we're going to have another round of fsync stuff in the next set of
releases anyway...

Greetings,

Andres Freund

Attachment Content-Type Size
0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patch text/x-patch 8.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-03-27 23:30:33 backup tools ought to ensure created backups are durable
Previous Message Tom Lane 2016-03-27 22:22:04 Re: Two division by 0 errors in optimizer/plan/planner.c and optimizer/path/costsize.c