Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-10 15:28:07
Message-ID: CA+TgmoapNxRAvfnOFD4FPM10wT90gjfgGZhDHW4kn=mOEVkuFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> ... but *only if they hit an I/O error* or they're on a FS that
> doesn't reserve space and hit ENOSPC.
>
> It still does 99% of the job. It still flushes all buffers to
> persistent storage and maintains write ordering. It may not detect and
> report failures to the user how we'd expect it to, yes, and that's not
> great. But it's hardly throw up our hands and give up territory
> either. Also, at least for initdb, we can make initdb fsync() its own
> files before close(). Annoying but hardly the end of the world.

I think we'd need every child postgres process started by initdb to do
that individually, which I suspect would slow down initdb quite a lot.
Now admittedly for anybody other than a PostgreSQL developer that's
only a minor issue, and our regression tests mostly run with fsync=off
anyway. But I have a strong suspicion that our assumptions about how
fsync() reports errors are baked into an awful lot of parts of the
system, and by the time we get unbaking them I think it's going to be
really surprising if we haven't done real harm to overall system
performance.

BTW, I took a look at the MariaDB source code to see whether they've
got this problem too and it sure looks like they do.
os_file_fsync_posix() retries the fsync in a loop with an 0.2 second
sleep after each retry. It warns after 100 failures and fails an
assertion after 1000 failures. It is hard to understand why they
would have written the code this way unless they expect errors
reported by fsync() to continue being reported until the underlying
condition is corrected. But, it looks like they wouldn't have the
problem that we do with trying to reopen files to fsync() them later
-- I spot checked a few places where this code is invoked and in all
of those it looks like the file is already expected to be open.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Finzel 2018-04-10 15:32:00 Including SQL files in extension scripts
Previous Message Tom Lane 2018-04-10 15:27:47 Re: crash with sql language partition support function