Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-08 21:47:04
Message-ID: 20180408214704.GA18969@technoir
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> On 8 April 2018 at 04:27, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> > On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
> > wrote:
> >
> > If the kernel does writeback in the middle, how on earth is it supposed to
> > know we expect to reopen the file and check back later?
> >
> > Should it just remember "this file had an error" forever, and tell every
> > caller? In that case how could we recover? We'd need some new API to say
> > "yeah, ok already, I'm redoing all my work since the last good fsync() so
> > you can clear the error flag now". Otherwise it'd keep reporting an error
> > after we did redo to recover, too.
>
> There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
> track of any errors. We just need fsync to accurately report whether
> all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been
written out, due to reasons outlined before. As such it may make
some sense to maintain some tracking regarding errors even after
marking failed dirty pages as clean (in fact it has been proposed,
but this introduces memory overhead).

> again the kernel needs to initiate i/o on all the dirty buffers and
> block until they complete successfully. If they complete successfully
> then nobody cares whether they had some failure in the past when i/o
> was initiated at some point in the past.

The question is, what should the kernel and application do in cases
where this is simply not possible (according to freebsd that keeps
dirty pages around after failure, for example, -EIO from the block
layer is a contract for unrecoverable errors so it is pointless to
keep them dirty). You'd need a specialized interface to clear-out
the errors (and drop the dirty pages), or potentially just remount
the filesystem.

> The problem is not that errors aren't been tracked correctly. The
> problem is that dirty buffers are being marked clean when they haven't
> been written out. They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".
>
> As long as any error means the kernel has discarded writes then
> there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes
or not would not really help (see above). It is more a matter of
proper "reporting contract" between userspace and kernel, and tracking
would be a way for facilitating this vs. having a more complex userspace
scheme (as described by others in this thread) where synchronization
for fsync() is required in a multi-process application.

Best regards,
Anthony

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-08 21:54:31 Re: pgsql: Support partition pruning at execution time
Previous Message Christophe Pettus 2018-04-08 21:28:43 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS