Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Geoff Winkless <pgsqladmin(at)geoff(dot)dj>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-18 11:46:15
Message-ID: 20180418114615.GB20040@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 18, 2018 at 06:04:30PM +0800, Craig Ringer wrote:
> On 18 April 2018 at 05:19, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
> >> On 10 April 2018 at 02:59, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> >>
> >> > Nitpick: In most cases the kernel reserves disk space immediately,
> >> > before returning from write(). NFS seems to be the main exception
> >> > here.
> >>
> >> I'm kind of puzzled by this. Surely NFS servers store the data in the
> >> filesystem using write(2) or the in-kernel equivalent? So if the
> >> server is backed by a filesystem where write(2) preallocates space
> >> surely the NFS server must behave as if it'spreallocating as well? I
> >> would expect NFS to provide basically the same set of possible
> >> failures as the underlying filesystem (as long as you don't enable
> >> nosync of course).
> >
> > I don't think the write is _sent_ to the NFS at the time of the write,
> > so while the NFS side would reserve the space, it might get the write
> > request until after we return write success to the process.
>
> It should be sent if you're using sync mode.
>
> >From my reading of the docs, if you're using async mode you're already
> open to so many potential corruptions you might as well not bother.
>
> I need to look into this more re NFS and expand the tests I have to
> cover that properly.

So, if sync mode passes the write to NFS, and NFS pre-reserves write
space, and throws an error on reservation failure, that means that NFS
will not corrupt a cluster on out-of-space errors.

So, what about thin provisioning? I can understand sharing _free_ space
among file systems, but once a write arrives I assume it reserves the
space. Is the problem that many thin provisioning systems don't have a
sync mode, so you can't force the write to appear on the device before
an fsync?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-04-18 11:52:39 Re: Built-in connection pooling
Previous Message Arthur Zakirov 2018-04-18 11:37:11 Re: [HACKERS] proposal: schema variables