Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-03 16:52:07
Message-ID: 20180403165207.GR11627@technoir
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:
> On 3 April 2018 at 14:36, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:
>
> > If EIO persists between invocations until explicitly cleared, a process
> > cannot possibly make any decision as to if it should clear the error
>
> I still don't understand what "clear the error" means here. The writes
> still haven't been written out. We don't care about tracking errors,
> we just care whether all the writes to the file have been flushed to
> disk. By "clear the error" you mean throw away the dirty pages and
> revert part of the file to some old data? Why would anyone ever want
> that?

It means that the responsibility of recovering the data is passed
back to the application. The writes may never be able to be written
out. How would a kernel deal with that? Either discard the data
(and have the writer acknowledge) or buffer the data until reboot
and simply risk going OOM. It's not what someone would want, but
rather *need* to deal with, one way or the other. At least on the
application-level there's a fighting chance for restoring to a
consistent state. The kernel does not have that opportunity.

> > But instead of deconstructing and debating the semantics of the
> > current mechanism, why not come up with the ideal desired form of
> > error reporting/tracking granularity etc., and see how this may be
> > fitted into kernels as a new interface.
>
> Because Postgres is portable software that won't be able to use some
> Linux-specific interface. And doesn't really need any granular error

I don't really follow this argument, Pg is admittedly using non-portable
interfaces (e.g the sync_file_range()). While it's nice to avoid platform
specific hacks, expecting that the POSIX semantics will be consistent
across systems is simply a 90's pipe dream. While it would be lovely
to have really consistent interfaces for application writers, this is
simply not going to happen any time soon.

And since those problematic semantics of fsync() appear to be prevalent
in other systems as well that are not likely to be changed, you cannot
rely on preconception that once buffers are handed over to kernel you
have a guarantee that they will be eventually persisted no matter what.
(Why even bother having fsync() in that case? The kernel would eventually
evict and writeback dirty pages anyway. The point of reporting the error
back to the application is to give it a chance to recover - the kernel
could repeat "fsync()" itself internally if this would solve anything).

> reporting system anyways. It just needs to know when all writes have
> been synced to disk.

Well, it does know when *some* writes have *not* been synced to disk,
exactly because the responsibility is passed back to the application.
I do realize this puts more burden back to the application, but what
would a viable alternative be? Would you rather have a kernel that
risks periodically going OOM due to this design decision?

Best regards,
Anthony

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-04-03 17:13:56 Re: pgsql: Validate page level checksums in base backups
Previous Message Teodor Sigaev 2018-04-03 16:50:02 Re: Prefix operator for text and spgist support