Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-05 08:46:08
Message-ID: CAMsr+YETSXaZ-kVMekgmsZFL2X7A7198ghHN4JFoHG-k3TvT2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 April 2018 at 15:09, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> Also, it's been reported to me off-list that anyone on the system calling
> sync(2) or the sync shell command will also generally consume the write
> error, causing us not to see it when we fsync(). The same is true
> for /proc/sys/vm/drop_caches. I have not tested these yet.
>

I just confirmed this with a tweak to the test that

records the file position
close()s the fd
sync()s
open(s) the file
lseek()s back to the recorded position

This causes the test to completely ignore the I/O error, which is not
reported to it at any time.

Fair enough, really, when you look at it from the kernel's point of view.
What else can it do? Nobody has the file open. It'd have to mark the file
its self as bad somehow. But that's pretty bad for our robustness AFAICS.

> There's some level of agreement that we should PANIC on fsync() errors, at
> least on Linux, but likely everywhere. But we also now know it's
> insufficient to be fully protective.
>

If dirty writeback fails between our close() and re-open() I see the same
behaviour as with sync(). To test that I set dirty_writeback_centisecs
and dirty_expire_centisecs to 1 and added a usleep(3*100*1000) between
close() and open(). (It's still plenty slow). So sync() is a convenient way
to simulate something other than our own fsync() writing out the dirty
buffer.

If I omit the sync() then we get the error reported by fsync() once when we
re open() the file and fsync() it, because the buffers weren't written out
yet, so the error wasn't generated until we re-open()ed the file. But I
doubt that'll happen much in practice because dirty writeback will get to
it first so the error will be seen and discarded before we reopen the file
in the checkpointer.

In other words, it looks like *even with a new kernel with the error
reporting bug fixes*, if I understand how the backends and checkpointer
interact when it comes to file descriptors, we're unlikely to notice I/O
errors and fail a checkpoint. We may notice I/O errors if a backend does
its own eager writeback for large I/O operations, or if the checkpointer
fsync()s a file before the kernel's dirty writeback gets around to trying
to flush the pages that will fail.

I haven't tested anything with multiple processes / multiple FDs yet, where
we keep one fd open while writing on another.

But at this point I don't see any way to make Pg reliably detect I/O errors
and fail a checkpoint then redo and retry. To even fix this by PANICing
like I proposed originally, we need to know we have to PANIC.

AFAICS it's completely unsafe to write(), close(), open() and fsync() and
expect that the fsync() makes any promises about the write(). Which if I
read Pg's low level storage code right, makes it completely unable to
reliably detect I/O errors.

When put it that way, it sounds fair enough too. How long is the kernel
meant to remember that there was a write error on the file triggered by a
write initiated by some seemingly unrelated process, some unbounded time
ago, on a since-closed file?

But it seems to put Pg on the fast track to O_DIRECT.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2018-04-05 09:07:26 Re: Online enabling of checksums
Previous Message Tomas Vondra 2018-04-05 08:45:49 Re: [HACKERS] logical decoding of two-phase transactions