Skip site navigation (1) Skip section navigation (2)

Re: silent data loss with ext4 / all current versions

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>,PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>,dpage(at)pgadmin(dot)org, Guillaume Lelarge <guillaume(dot)lelarge(at)dalibo(dot)com>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-03-27 23:25:09
Message-ID: 20160327232509.v5wgac5vskusedin@awork2.anarazel.de (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-hackers
Hi,

On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:
> +/*
> + * Sync data directory to ensure that what has been generated up to now is
> + * persistent in case of a crash, and this is done once globally for
> + * performance reasons as sync requests on individual files would be
> + * a negative impact on the running time of pg_rewind. This is invoked at
> + * the end of processing once everything has been processed and written.
> + */
> +static void
> +syncTargetDirectory(const char *argv0)
> +{
> +	int		ret;
> +	char	exec_path[MAXPGPATH];
> +	char	cmd[MAXPGPATH];
> +
> +	if (dry_run)
> +		return;

I think it makes more sense to return after detecting the binary, so
you'd find out about problems around not finding initdb during the dry
run, not later.


> +	/* Grab and invoke initdb to perform the sync */
> +	if ((ret = find_other_exec(argv0, "initdb",
> +							   "initdb (PostgreSQL) " PG_VERSION "\n",
> +							   exec_path)) < 0)
> +	{
> +		char        full_path[MAXPGPATH];
> +
> +		if (find_my_exec(argv0, full_path) < 0)
> +			strlcpy(full_path, progname, sizeof(full_path));
> +
> +		if (ret == -1)
> +			pg_fatal("The program \"initdb\" is needed by %s but was \n"
> +					 "not found in the same directory as \"%s\".\n"
> +					 "Check your installation.\n", progname, full_path);
> +		else
> +			pg_fatal("The program \"postgres\" was found by \"%s\" but was \n"
> +					 "not the same version as %s.\n"
> +					 "Check your installation.\n", progname, full_path);

Wrong binary name.


> +	}
> +
> +	/* now run initdb */
> +	if (debug)
> +		snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S",
> +				 exec_path, datadir_target);
> +	else
> +		snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"",
> +				 exec_path, datadir_target, DEVNULL);
> +
> +	if (system(cmd) != 0)
> +		pg_fatal("sync of target directory with initdb -S failed\n");
> +	pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n");
> +}

Don't see need for emitting "done", for now at least.


>  /*
> + * Wrapper of rename() similar to the backend version with the same function
> + * name aimed at making the renaming durable on disk. Note that this version
> + * does not fsync the old file before the rename as all the code paths leading
> + * to this function are already doing this operation. The new file is also
> + * normally not present on disk before the renaming so there is no need to
> + * bother about it.

I don't think it's a good idea to skip fsyncing the old file based on
that; it's way too likely that that'll not be done for the next user of
durable_rename.


> + */
> +static int
> +durable_rename(const char *oldfile, const char *newfile)
> +{
> +	int		fd;
> +	char	parentpath[MAXPGPATH];
> +
> +	if (rename(oldfile, newfile) != 0)
> +	{
> +		/* durable_rename produced a log entry */

Uh?


> +		fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
> +				progname, current_walfile_name, strerror(errno));

current_walfile_name doesn't look right, that's a largely independent
global variable.


> +	if (fsync(fd) != 0)
> +	{
> +		close(fd);
> +		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> +				progname, newfile, strerror(errno));
> +		return -1;

Close should be after the strerror() call (yes, that mistake already
existed once in receivelog.c).


> +	fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
> +
> +	/*
> +	 * Some OSs don't allow us to open directories at all (Windows returns
> +	 * EACCES), just ignore the error in that case.  If desired also silently
> +	 * ignoring errors about unreadable files. Log others.
> +	 */

Comment is not applicable as a whole.


> +	if (fd < 0 && (errno == EISDIR || errno == EACCES))
> +		return 0;
> +	else if (fd < 0)
> +	{
> +		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
> +				progname, parentpath, strerror(errno));
> +		return -1;
> +	}
> +
> +	if (fsync(fd) != 0)
> +	{
> +		close(fd);
> +		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> +				progname, parentpath, strerror(errno));
> +		return -1;

close() needs to be moved again.


It'd be easier to apply these if the rate of trivial issues were lower
:(.


Attached is an heavily revised version of the patch. Besides the above,
I've:
* copied fsync_fname_ext from initdb, I think it's easier to understand
  code this way, and it'll make unifying the code easier
* added fsyncing of the directory in mark_file_as_archived()
* The WAL files also need to be fsynced when created in open_walfile():
  - otherwise the directory entry itself isn't safely persistent, as we
    don't fsync the directory in the stream->synchronous fsync() cases.
  - we refuse to resume in open_walfile(), if a file isn't 16MB when
    restarting. Without an fsync that's actually not unlikely after a
    crash. Even with an fsync that's not guaranteed not to happen, but
    the chance of it is much lower.

I'm too tired to push this at the eleventh hour. Besides a heavily
revised patch, backpatching will likely include a number of conflicts.
If somebody in the US has the energy to take care of this...


I've also noticed that

a) pg_basebackup doesn't do anything about durability (it probably needs
   a very similar patch to the one pg_rewind just received).
b) nor does pg_dump[all]

I think it's pretty unacceptable for backup tools to be so cavalier
about durability.


So we're going to have another round of fsync stuff in the next set of
releases anyway...


Greetings,

Andres Freund

Attachment: 0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patch
Description: text/x-patch (8.0 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Andres FreundDate: 2016-03-27 23:30:33
Subject: backup tools ought to ensure created backups are durable
Previous:From: Tom LaneDate: 2016-03-27 22:22:04
Subject: Re: Two division by 0 errors in optimizer/plan/planner.c and optimizer/path/costsize.c

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group