Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-03-29 12:07:56
Message-ID: CAEepm=06m88JtB6cefTKe74W4+7s9wd=9+wHg6E8P2R4gfKfgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:
>> > TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
> well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

>> Moreover, POSIX is entirely clear that successful fsync means all
>> preceding writes for the file have been completed, full stop, doesn't
>> matter when they were issued.
>
> I can't find anything that says so to me. Please quote relevant spec.
>
> I'm working from
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
> states that
>
> "The fsync() function shall request that all data for the open file
> descriptor named by fildes is to be transferred to the storage device
> associated with the file described by fildes. The nature of the transfer is
> implementation-defined. The fsync() function shall not return until the
> system has completed that action or until an error is detected."
>
> My reading is that POSIX does not specify what happens AFTER an error is
> detected. It doesn't say that error has to be persistent and that subsequent
> calls must also report the error. It also says:

FWIW my reading is the same as Tom's. It says "all data for the open
file descriptor" without qualification or special treatment after
errors. Not "some".

> I'm not seeking to defend what the kernel seems to be doing. Rather, saying
> that we might see similar behaviour on other platforms, crazy or not. I
> haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave
that way without strong evidence... This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article. I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2018-03-29 12:17:38 Re: csv format for psql
Previous Message Alvaro Herrera 2018-03-29 12:06:26 Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)