Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-04 05:29:28
Message-ID: CAEepm=0Wx4koMzmouvxanr_Ew0e5uk-JGHgSF=rkT9Hs6Mi2_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> Uh, are you sure it fixes our use-case? From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything. I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that. Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later. But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

If so I'm not sure how that can possibly be considered to be an
implementation of _POSIX_SYNCHRONIZED_IO: "the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state." Note "the file", not "this file descriptor + copies", and
without reference to when you opened it.

> But I'm not sure what the lifetime of the passed-in "file" and more
> importantly "file->f_wb_err" is.

It's really inode->i_mapping->wb_err's lifetime that I should have
been asking about there, not file->f_wb_err, but I see now that that
question is irrelevant due to the above.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-04-04 05:42:58 Re: [HACKERS] path toward faster partition pruning
Previous Message Beena Emerson 2018-04-04 04:48:08 Re: [HACKERS] Runtime Partition Pruning