Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-04 14:00:15
Message-ID: CAMsr+YErjSKktTQgG0EF9FusDLnvsE32knFWrzFkKSxHyPy9gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 April 2018 at 15:51, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> On 4 April 2018 at 14:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
>> On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
>> wrote:
>>
>>> On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
>>> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> > On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us>
>>> wrote:
>>> >> Uh, are you sure it fixes our use-case? From the email description it
>>> >> sounded like it only reported fsync errors for every open file
>>> >> descriptor at the time of the failure, but the checkpoint process
>>> might
>>> >> open the file _after_ the failure and try to fsync a write that
>>> happened
>>> >> _before_ the failure.
>>> >
>>> > I'm not sure of anything. I can see that it's designed to report
>>> > errors since the last fsync() of the *file* (presumably via any fd),
>>> > which sounds like the desired behaviour:
>>> >
>>> > [..]
>>>
>>> Scratch that. Whenever you open a file descriptor you can't see any
>>> preceding errors at all, because:
>>>
>>> /* Ensure that we skip any errors that predate opening of the file */
>>> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
>>>
>>> https://github.com/torvalds/linux/blob/master/fs/open.c#L752
>>>
>>> Our whole design is based on being able to open, close and reopen
>>> files at will from any process, and in particular to fsync() from a
>>> different process that didn't inherit the fd but instead opened it
>>> later. But it looks like that might be able to eat errors that
>>> occurred during asynchronous writeback (when there was nobody to
>>> report them to), before you opened the file?
>>>
>>
>> Holy hell. So even PANICing on fsync() isn't sufficient, because the
>> kernel will deliberately hide writeback errors that predate our fsync()
>> call from us?
>>
>> I'll see if I can expand my testcase for that. I'm presently dockerizing
>> it to make it easier for others to use, but that turns out to be a major
>> pain when using devmapper etc. Docker in privileged mode doesn't seem to
>> play nice with device-mapper.
>>
>>
> Done, you can find it in https://github.com/ringerc/scrapcode/tree/master/
> testcases/fsync-error-clear now.
>
>
Update. Now supports multiple FSes.

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
Didn't try zfs-on-linux or other platforms yet.

Still working on getting ENOSPC on fsync() rather than write(). Kernel code
reading suggests this is possible, but all the above FSes reserve space
eagerly on write( ) even if they do delayed allocation of the actual
storage, so it doesn't seem to happen at least in my simple single-process
test.

I'm not overly inclined to complain about a fsync() succeeding after a
write() error. That seems reasonable enough, the kernel told the app at the
time of the failure. What else is it going to do? I don't personally even
object hugely to the current fsync() behaviour if it were, say, DOCUMENTED
and conformant to the relevant standards, though not giving us any sane way
to find out the affected file ranges makes it drastically harder to recover
sensibly.

But what's come out since on this thread, that we cannot even rely on
fsync() giving us an EIO *once* when it loses our data, because:

- all currently widely deployed kernels can fail to deliver info due to
recently fixed limitation; and
- the kernel deliberately hides errors from us if they relate to writes
that occurred before we opened the FD (?)

... that's really troubling. I thought we could at least fix this by
PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we
can't even do that and expect reliability. So how the @#$ are we meant to
do?

It's the error reporting issues around closing and reopening files with
outstanding buffered I/O that's really going to hurt us here. I'll be
expanding my test case to cover that shortly.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2018-04-04 14:09:03 Re: json(b)_to_tsvector with numeric values
Previous Message Bruce Momjian 2018-04-04 13:53:01 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS