Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
Date: 2016-09-13 04:16:58
Message-ID: CAB7nPqQEUj+zi7fqv374W4tgxE3=Z=wmsbc+_m1NH5BFqgeXZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 10, 2016 at 6:27 PM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> In fsync_pgdata(), you have removed the progname from one error
> message, even though it is passed into the function.

Right. Good catch.

> Also, I think
> fsync_pgdata() should not be printing initdb's progress messages.
> That should stay in initdb.

That's a reference to that.
+ fputs(_("syncing data to disk ... "), stdout);
+ fflush(stdout);
Oops, fixed. I missed that now that I look at it. Perhaps I was
thinking something else when hacking that lastly, like getting this
message out for pg_basebackup as well.

> Worse, in 0002 you are then changing the
> output, presumably to suit pg_basebackup or something else, which
> messes up the initdb output.

Yes, fixed.

> durable_rename() does not fsync an existing newfile before renaming,
> in contrast to the backend code in fd.c.

That's in 0002. In the case of initdb it did not really matter, but
that matters for pg_receivexlog, so let's do it unconditionally. I
reworked the code to check if the existing new file is here, and
fsync() if it exists.

> I'm slightly concerned about lots of code duplication in
> durable_rename, fsync_parent_path between backend and standalone code.

Based on the ground of code readability, I am not sure that it would
be worth sharing the code of durable_rename or even fsync_parent_path
between the backend and the frontend, particularly because they are
actually not really duplicated... One good step in favor of that would
be to get a elog()/ereport() layer in src/common as well, but that's
really something that this set of patches should tackle, no?

> Also, I'm equally slightly concerned that having duplicate function
> names will mess up tag search for someone.

There are similar cases, take palloc().. Now perhaps some of those
functions could be renamed pg_fsync_... Thoughts?

> This is a preexisting mistake, but fsync_fname_ext() should just be
> fsync_fname() to match the backend naming. In the backend,
> fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm
> argument, but the initdb code doesn't do that.

OK.

> I don't think tar file output in pg_basebackup needs to be fsynced.
> It's just a backup file like what pg_dump produces, and we don't fsync
> that either. The point of this change is to leave a data directory in
> a synced state equivalent to what initdb leaves behind.

Here I don't agree. The point of the patch is to make sure that what
gets generated by pg_basebackup is consistent on disk, for all the
formats. It seems weird to me that we could say that the plain format
makes things consistent while the tar format may cause some files to
be lost in case of power outage. The docs make it clear I think:
+ By default, <command>pg_basebackup</command> will wait for all files
+ to be written safely to disk.
But perhaps this should directly mention that this is done for all the formats?

> Some of the changes in receivelog.c seem excessive. Replacing a plain
> fsync() with fsync_fname_ext() plus fsync_parent_path() isn't
> justified by the references you have provided. This would need more
> explanation.

In mark_file_as_archived(), we had better do it or in case of a crash
the .done file would just be gone. open_walfile() as well need that,
to ensure that the segment is created and zeroed, and in the case
where it was padded (is this one overthinking it?).

Patch 0003 had conflicts caused by 9083353.
--
Michael

Attachment Content-Type Size
0001-Relocation-fsync-routines-of-initdb-into-src-common.patch text/x-diff 19.7 KB
0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patch text/x-diff 10.7 KB
0003-Add-no-sync-option-to-pg_basebackup.patch text/x-diff 8.3 KB
0004-Switch-pg_basebackup-commands-in-Postgres.pm-to-use-.patch text/x-diff 846 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2016-09-13 04:48:57 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Craig Ringer 2016-09-13 03:59:41 9.6 TAP tests and extensions