Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-03 01:29:28
Message-ID: CAEepm=2KSqu-fj8gEbLSE=uNcWWWpZ4bcjFtqYTGSCp0Lr_cSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> I see little benefit to not just PANICing unconditionally on EIO, really. It
> shouldn't happen, and if it does, we want to be pretty conservative and
> adopt a data-protective approach.
>
> I'm rather more worried by doing it on ENOSPC. Which looks like it might be
> necessary from what I recall finding in my test case + kernel code reading.
> I really don't want to respond to a possibly-transient ENOSPC by PANICing
> the whole server unnecessarily.

Yeah, it'd be nice to give an administrator the chance to free up some
disk space after ENOSPC is reported, and stay up. Running out of
space really shouldn't take down the database without warning! The
question is whether the data remains in cache and marked dirty, so
that retrying is a safe option (since it's potentially gone from our
own buffers, so if the OS doesn't have it the only place your
committed data can definitely still be found is the WAL... recovery
time). Who can tell us? Do we need a per-filesystem answer? Delayed
allocation is a somewhat filesystem-specific thing, so maybe.
Interestingly, there don't seem to be many operating systems that can
report ENOSPC from fsync(), based on a quick scan through some
documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
Illumos/Solaris, Linux, macOS: yes

I don't know if macOS really means it or not; it just tells you to see
the errors for read(2) and write(2). By the way, speaking of macOS, I
was curious to see if the common BSD heritage would show here. Yeah,
somewhat. It doesn't appear to keep buffers on writeback error, if
this is the right code[1] (though it could be handling it somewhere
else for all I know).

[1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-04-03 02:18:00 Re: [HACKERS] MERGE SQL Statement for PG11
Previous Message Bruce Momjian 2018-04-03 00:57:12 Re: 2018-03 Commitfest Summary (Andres #1)