Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-03 10:35:39
Message-ID: 20180403103538.GP11627@technoir
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert,

On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:
> On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
>
> Like other people here, I think this is 100% unreasonable, starting
> with "the dirty pages which cannot been written out are practically
> thrown away". Who decided that was OK, and on the basis of what
> wording in what specification? I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux
glibc configuration and associated manpage are probably wrong in
stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
matches that of the flexibility allowed by not supporting SIO.
There's a long history of brokenness between linux and posix,
and I think there was never an intention of conforming to the
standard.

> throw away the user's data. If the writes are going to fail, then let
> them keep on failing every time. *That* wouldn't cause any data loss,
> because we'd never be able to checkpoint, and eventually the user
> would have to kill the server uncleanly, and that would trigger
> recovery.

I believe (as tried to explain earlier) there is a certain assumption
being made that the writer and original owner of data is responsible
for dealing with potential errors in order to avoid data loss (which
should be only of interest to the original writer anyway). It would
be very questionable for the interface to persist the error while
subsequent writes and fsyncs to different offsets may as well go through.
Another process may need to write into the file and fsync, while being
unaware of those newly introduced semantics is now faced with EIO
because some unrelated previous process failed some earlier writes
and did not bother to clear the error for those writes. In a similar
scenario where the second process is aware of the new semantics, it would
naturally go ahead and clear the global error in order to proceed
with its own write()+fsync(), which would essentially amount to the
same problematic semantics you have now.

> Also, this really does make it impossible to write reliable programs.
> Imagine that, while the server is running, somebody runs a program
> which opens a file in the data directory, calls fsync() on it, and
> closes it. If the fsync() fails, postgres is now borked and has no
> way of being aware of the problem. If we knew, we could PANIC, but
> we'll never find out, because the unrelated process ate the error.
> This is exactly the sort of ill-considered behavior that makes fcntl()
> locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue
of making sure that the error is reported to all file descriptors that
*happen to be open at the time of error*. But I think one would have a
hard time defending a modification to the kernel where this is further
extended to cover cases where:

process A does write() on some file offset which fails writeback,
fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback,
but fsync() gets EIO due to (uncleared) failures of earlier process.

This would be a highly user-visible change of semantics from edge-
triggered to level-triggered behavior.

> dodge this issue in another way: suppose that when we write a page
> out, we don't consider it really written until fsync() succeeds. Then

That's the only way to think about fsync() guarantees unless you
are on a kernel that keeps retrying to persist dirty pages. Assuming
such a model, after repeated and unrecoverable hard failures the
process would have to explicitly inform the kernel to drop the dirty
pages. All the process could do at that point is read back to userspace
the dirty/failed pages and attempt to rewrite them at a different place
(which is current possible too). Most applications would not bother
though to inform the kernel and drop the permanently failed pages;
and thus someone eventually would hit the case that a large amount
of failed writeback pages are running his server out of memory,
at which point people will complain that those semantics are completely
unreasonable.

> we wouldn't need to PANIC if an fsync() fails; we could just re-write
> the page. Unfortunately, this would also be terrible for performance,
> for pretty much the same reasons: letting the OS cache absorb lots of
> dirty blocks and do write-combining is *necessary* for good
> performance.

Not sure I understand this case. The application may indeed re-write
a bunch of pages that have failed and proceed with fsync(). The kernel
will deal with combining the writeback of all the re-written pages. But
further the necessity of combining for performance really depends on
the exact storage medium. At the point you start caring about
write-combining, the kernel community will naturally redirect you to
use DIRECT_IO.

> > The error reporting is thus consistent with the intended semantics (which
> > are sadly not properly documented). Repeated calls to fsync() simply do not
> > imply that the kernel will retry to writeback the previously-failed pages,
> > so the application needs to be aware of that. Persisting the error at the
> > fsync() level would essentially mean moving application policy into the
> > kernel.
>
> I might accept this argument if I accepted that it was OK to decide
> that an fsync() failure means you can forget that the write() ever
> happened in the first place, but it's hard to imagine an application
> that wants that behavior. If the application didn't care about
> whether the bytes really got to disk or not, it would not have called
> fsync() in the first place. If it does care, reporting the error only
> once is never an improvement.

Again, conflating two separate issues, that of buffering and retrying
failed pages and that of error reporting. Yes it would be convenient
for applications not to have to care at all about recovery of failed
write-backs, but at some point they would have to face this issue one
way or another (I am assuming we are always talking about hard failures,
other kinds of failures are probably already being dealt with transparently
at the kernel level).

As for the reporting, it is also unreasonable to effectively signal
and persist an error on a file-wide granularity while it pertains
to subsets of that file and other writes can go through, but I am
repeating myself.

I suppose that if the check-and-clear semantics are problematic for
Pg, one could suggest a kernel patch that opts-in to a level-triggered
reporting of fsync() on a per-descriptor basis, which seems to be
non-intrusive and probably sufficient to cover your expected use-case.

Best regards,
Anthony

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikhil Sontakke 2018-04-03 10:40:43 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Simon Riggs 2018-04-03 09:24:12 pgsql: New files for MERGE