Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-03-29 05:58:45
Message-ID: CAMsr+YF1wcY6TkRXvtD=se5=qX1zJ8N8dieBV7qeCOoG7fXRCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:
> > TL;DR: Pg should PANIC on fsync() EIO return.
>
> Surely you jest.
>

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only
because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going
to lose a write, it should at minimum mark the FD as broken so no further
fsync() or anything else can succeed on the FD, and an app that cares about
durability must repeat the whole set of work since the prior succesful
fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms
behave? And how long before these kernels are out of use? We'd better deal
with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note
also the test case link there. https://stackoverflow.com/a/42436054/398670

> Retrying fsync() is not OK at
> > least on Linux. When fsync() returns success it means "all writes since
> the
> > last fsync have hit disk" but we assume it means "all writes since the
> last
> > SUCCESSFUL fsync have hit disk".
>
> If that's actually the case, we need to push back on this kernel brain
> damage, because as you're describing it fsync would be completely useless.
>

It's not useless, it's just telling us something other than what we think
it means. The promise it seems to give us is that if it reports an error
once, everything *after* that is useless, so we should throw our toys,
close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I
thought in some other areas too. See that lwn.net article he linked.

> Moreover, POSIX is entirely clear that successful fsync means all
> preceding writes for the file have been completed, full stop, doesn't
> matter when they were issued.
>

I can't find anything that says so to me. Please quote relevant spec.

I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that
subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not
guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to
mean that once there's been an error reported for some IO operations
there's no guarantee those operations are ever completed even after a
subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2018-03-29 06:37:20 Re: [HACKERS] MERGE SQL Statement for PG11
Previous Message Haribabu Kommi 2018-03-29 05:54:34 Re: [HACKERS] Pluggable storage