Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-09 12:16:38
Message-ID: CAMsr+YE0hvvaeAz2GbfzHYgPfZeN4KK+bCo6yMZTrNTsfCcTzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9 April 2018 at 18:50, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:

>
> There is a clear responsibility of the application to keep
> its buffers around until a successful fsync(). The kernels
> do report the error (albeit with all the complexities of
> dealing with the interface), at which point the application
> may not assume that the write()s where ever even buffered
> in the kernel page cache in the first place.
>

> What you seem to be asking for is the capability of dropping
> buffers over the (kernel) fence and idemnifying the application
> from any further responsibility, i.e. a hard assurance
> that either the kernel will persist the pages or it will
> keep them around till the application recovers them
> asynchronously, the filesystem is unmounted, or the system
> is rebooted.
>

That's what Pg appears to assume now, yes.

Whether that's reasonable is a whole different topic.

I'd like a middle ground where the kernel lets us register our interest and
tells us if it lost something, without us having to keep eight million FDs
open for some long period. "Tell us about anything that happens under
pgdata/" or an inotify-style per-directory-registration option. I'd even
say that's ideal.

In the mean time, I propose that we fsync() on close() before we age FDs
out of the LRU on backends. Yes, that will hurt throughput and cause
stalls, but we don't seem to have many better options. At least it'll only
flush what we actually wrote to the OS buffers not what we may have in
shared_buffers. If the bgwriter does the same thing, we should be 100% safe
from this problem on 4.13+, and it'd be trivial to make it a GUC much like
the fsync or full_page_writes options that people can turn off if they know
the risks / know their storage is safe / don't care.

Some keen person who wants to later could optimise it by adding a fsync
worker thread pool in backends, so we don't block the main thread. Frankly
that might be a nice thing to have in the checkpointer anyway. But it's out
of scope for fixing this in durability terms.

I'm partway through a patch that makes fsync panic on errors now. Once
that's done, the next step will be to force fsync on close() in md and see
how we go with that.

Thoughts?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-04-09 12:22:49 Re: [sqlsmith] Failed assertion in create_gather_path
Previous Message Ashutosh Bapat 2018-04-09 12:16:13 Re: Optimizing nested ConvertRowtypeExpr execution