Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-03 13:36:47
Message-ID: 20180403133647.GQ11627@technoir
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:
> On 3 April 2018 at 11:35, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:
> > Hi Robert,
> >
> > Fully agree, and the errseq_t fixes have dealt exactly with the issue
> > of making sure that the error is reported to all file descriptors that
> > *happen to be open at the time of error*. But I think one would have a
> > hard time defending a modification to the kernel where this is further
> > extended to cover cases where:
> >
> > process A does write() on some file offset which fails writeback,
> > fsync() gets EIO and exit()s.
> >
> > process B does write() on some other offset which succeeds writeback,
> > but fsync() gets EIO due to (uncleared) failures of earlier process.
>
>
> Surely that's exactly what process B would want? If it calls fsync and
> gets a success and later finds out that the file is corrupt and didn't
> match what was in memory it's not going to be happy.

You can't possibly make this assumption. Process B may be reading
and writing to completely disjoint regions from those of process A,
and as such not really caring about earlier failures, only wanting
to ensure its own writes go all the way through. But even if it did
care, the file interfaces make no transactional guarantees. Even
without fsync() there is nothing preventing process B from reading
dirty pages from process A, and based on their content proceed to
to its own business and write/persist new data, while process A
further modifies the not-yet-flushed pages in-memory before flushing.
In this case you'd need explicit synchronization/locking between
the processes anyway, so why would fsync() be an exception?

> This seems like an attempt to co-opt fsync for a new and different
> purpose for which it's poorly designed. It's not an async error
> reporting mechanism for writes. It would be useless as that as any
> process could come along and open your file and eat the errors for
> writes you performed. An async error reporting mechanism would have to
> document which writes it was giving errors for and give you ways to
> control that.

The errseq_t fixes deal with that; errors will be reported to any
process that has an open fd, irrespective to who is the actual caller
of the fsync() that may have induced errors. This is anyway required
as the kernel may evict dirty pages on its own by doing writeback and
as such there needs to be a way to report errors on all open fds.

> The semantics described here are useless for everyone. For a program
> needing to know the error status of the writes it executed, it doesn't
> know which writes are included in which fsync call. For a program

If EIO persists between invocations until explicitly cleared, a process
cannot possibly make any decision as to if it should clear the error
and proceed or some other process will need to leverage that without
coordination, or which writes actually failed for that matter.
We would be back to the case of requiring explicit synchronization
between processes that care about this, in which case the processes
may as well synchronize over calling fsync() in the first place.

Having an opt-in persisting EIO per-fd would practically be a form
of "contract" between "cooperating" processes anyway.

But instead of deconstructing and debating the semantics of the
current mechanism, why not come up with the ideal desired form of
error reporting/tracking granularity etc., and see how this may be
fitted into kernels as a new interface.

Best regards,
Anthony

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2018-04-03 13:37:38 Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently
Previous Message Ashutosh Bapat 2018-04-03 13:34:45 Re: [HACKERS] advanced partition matching algorithm for partition-wise join