Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-04 15:23:51
Message-ID: CAMsr+YFFeQo_nsfQff_tWxxvwg7gV88mVfFku4GQJYSRktnU3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 April 2018 at 21:49, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> > On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> > > On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
> > > wrote:
> > >> /* Ensure that we skip any errors that predate opening of the file */
> > >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > >>
> > >> [...]
> > >
> > > Holy hell. So even PANICing on fsync() isn't sufficient, because the
> kernel
> > > will deliberately hide writeback errors that predate our fsync() call
> from
> > > us?
> >
> > Predates the opening of the file by the process that calls fsync().
> > Yeah, it sure looks that way based on the above code fragment. Does
> > anyone know better?
>
> Uh, just to clarify, what is new here is that it is ignoring any
> _errors_ that happened before the open(). It is not ignoring write()'s
> that happened but have not been written to storage before the open().
>
> FYI, pg_test_fsync has always tested the ability to fsync() writes()
> from from other processes:
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written on a
> different
> descriptor.)
> write, fsync, close 5360.341 ops/sec
> 187 usecs/op
> write, close, fsync 4785.240 ops/sec
> 209 usecs/op
>
> Those two numbers should be similar. I added this as a check to make
> sure the behavior we were relying on was working. I never tested sync
> errors though.
>
> I think the fundamental issue is that we always assumed that writes to
> the kernel that could not be written to storage would remain in the
> kernel until they succeeded, and that fsync() would report their
> existence.
>
> I can understand why kernel developers don't want to keep failed sync
> buffers in memory, and once they are gone we lose reporting of their
> failure. Also, if the kernel is going to not retry the syncs, how long
> should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

The obvious answer seems to be "until the FD is closed". But we just
discussed how Pg relies on being able to open and close files freely. That
may not be as reasonable a thing to do as we thought it was when you
consider error reporting. What's the kernel meant to do? How long should it
remember "I had an error while doing writeback on this file"? Should it
flag the file metadata and remember across reboots? Obviously not, but
where does it stop? Tell the next program that does an fsync() and forget?
How could it associate a dirty buffer on a file with no open FDs with any
particular program at all? And what if the app did a write then closed the
file and went away, never to bother to check the file again, like most apps
do?

Some I/O errors are transient (network issue, etc). Some are recoverable
with some sort of action, like disk space issues, but may take a long time
before an admin steps in. Some are entirely unrecoverable (disk 1 in
striped array is on fire) and there's no possible recovery. Currently we
kind of hope the kernel will deal with figuring out which is which and
retrying. Turns out it doesn't do that so much, and I don't think the
reasons for that are wholly unreasonable. We may have been asking too much.

That does leave us in a pickle when it comes to the checkpointer and
opening/closing FDs. I don't know what the "right" thing for the kernel to
do from our perspective even is here, but the best I can come up with is
actually pretty close to what it does now. Report the fsync() error to the
first process that does an fsync() since the writeback error if one has
occurred, then forget about it. Ideally I'd have liked it to mark all FDs
pointing to the file with a flag to report EIO on next fsync too, but it
turns out that won't even help us due to our opening and closing behaviour,
so we're going to have to take responsibility for handling and
communicating that ourselves, preventing checkpoint completion if any
backend gets an fsync error. Probably by PANICing. Some extra work may be
needed to ensure reliable ordering and stop checkpoints completing if their
fsync() succeeds due to a recent failed fsync() on a normal backend that
hasn't PANICed or where the postmaster hasn't noticed yet.

Our only option might be to tell administrators to closely watch for
> kernel write failure messages, and then restore or failover. :-(
>

Speaking of, there's not necessarily any lost page write error in the logs
AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical
block 59393" or the like.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2018-04-04 15:45:56 Re: [HACKERS] path toward faster partition pruning
Previous Message Antonis Iliopoulos 2018-04-04 15:23:31 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS