Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-05 07:09:57
Message-ID: CAMsr+YFNivjj1eYX0-=jfaAi8u+Q6CSOXN82_xuALzXAdpWe-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Summary to date:

It's worse than I thought originally, because:

- Most widely deployed kernels have cases where they don't tell you about
losing your writes at all; and
- Information about loss of writes can be masked by closing and re-opening
a file

So the checkpointer cannot trust that a successful fsync() means ... a
successful fsync().

Also, it's been reported to me off-list that anyone on the system calling
sync(2) or the sync shell command will also generally consume the write
error, causing us not to see it when we fsync(). The same is true
for /proc/sys/vm/drop_caches. I have not tested these yet.

There's some level of agreement that we should PANIC on fsync() errors, at
least on Linux, but likely everywhere. But we also now know it's
insufficient to be fully protective.

I previously though that errors=remount-ro was a sufficient safeguard. It
isn't. There doesn't seem to be anything that is, for ext3, ext4, btrfs or
xfs.

It's not clear to me yet why data_err=abort isn't sufficient in
data=ordered or data=writeback mode on ext3 or ext4, needs more digging.
(In my test tools that's:
make FSTYPE=ext4 MKFSOPTS="" MOUNTOPTS="errors=remount-ro,
data_err=abort,data=journal"
as of the current version d7fe802ec). AFAICS that's because
data_error=abort only affects data=ordered, not data=journal. If you use
data=ordered, you at least get retries of the same write failing. This post
https://lkml.org/lkml/2008/10/10/80 added the option and has some
explanation, but doesn't explain why it doesn't affect data=journal.

zfs is probably not affected by the issues, per Thomas Munro. I haven't run
my test scripts on it yet because my kernel doesn't have zfs support and
I'm prioritising the multi-process / open-and-close issues.

So far none of the FSes and options I've tried exhibit the behavour I
actually want, which is to make the fs readonly or inaccessible on I/O
error.

ENOSPC doesn't seem to be a concern during normal operation of major file
systems (ext3, ext4, btrfs, xfs) because they reserve space before
returning from write(). But if a buffered write does manage to fail due to
ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a
potentially data corrupting condition since NFS doesn't preallocate space
before returning from write().

I think what we really need is a block-layer fix, where an I/O error flips
the block device into read-only mode, as if blockdev --setro had
been used. Though I'd settle for a kernel panic, frankly. I don't think
anybody really wants this, but I'd rather either of those to silent data
loss.

I'm currently tweaking my test to do some close and reopen the file between
each write() and fsync(), and to support running with nfs.

I've also just found the device-mapper "flakey" driver, which looks
fantastic for simulating unreliable I/O with intermittent faults. I've been
using the "error" target in a mapping, which lets me remap some of the
device to always error, but "flakey" looks very handy for actual PostgreSQL
testing.

For the sake of Google, these are errors known to be associated with the
problem:

ext4, and ext3 mounted with ext4 driver:

[42084.327345] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error
10 writing to inode 12 (offset 0 size 0 starting block 59393)
[42084.327352] Buffer I/O error on device dm-0, logical block 59393

xfs:

[42193.771367] XFS (dm-0): writeback error on sector 118784
[42193.784477] XFS (dm-0): writeback error on sector 118784

jfs: (nil, silence in the kernel logs)

You should also beware of "lost page write" or "lost write" errors.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2018-04-05 07:23:43 Excessive PostmasterIsAlive calls slow down WAL redo
Previous Message Noah Misch 2018-04-05 06:59:57 Re: BUG #15112: Unable to run pg_upgrade with earthdistance extension