PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-03-28 02:23:46
Message-ID: CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all

Some time ago I ran into an issue where a user encountered data corruption
after a storage error. PostgreSQL played a part in that corruption by
allowing checkpoint what should've been a fatal error.

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
least on Linux. When fsync() returns success it means "all writes since the
last fsync have hit disk" but we assume it means "all writes since the last
SUCCESSFUL fsync have hit disk".

Pg wrote some blocks, which went to OS dirty buffers for writeback.
Writeback failed due to an underlying storage error. The block I/O layer
and XFS marked the writeback page as failed (AS_EIO), but had no way to
tell the app about the failure. When Pg called fsync() on the FD during the
next checkpoint, fsync() returned EIO because of the flagged page, to tell
Pg that a previous async write failed. Pg treated the checkpoint as failed
and didn't advance the redo start position in the control file.

All good so far.

But then we retried the checkpoint, which retried the fsync(). The retry
succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.

The write never made it to disk, but we completed the checkpoint, and
merrily carried on our way. Whoops, data loss.

The clear-error-and-continue behaviour of fsync is not documented as far as
I can tell. Nor is fsync() returning EIO unless you have a very new linux
man-pages with the patch I wrote to add it. But from what I can see in the
POSIX standard we are not given any guarantees about what happens on
fsync() failure at all, so we're probably wrong to assume that retrying
fsync( ) is safe.

If the server had been using ext3 or ext4 with errors=remount-ro, the
problem wouldn't have occurred because the first I/O error would've
remounted the FS and stopped Pg from continuing. But XFS doesn't have that
option. There may be other situations where this can occur too, involving
LVM and/or multipath, but I haven't comprehensively dug out the details yet.

It proved possible to recover the system by faking up a backup label from
before the first incorrectly-successful checkpoint, forcing redo to repeat
and write the lost blocks. But ... what a mess.

I posted about the underlying fsync issue here some time ago:

https://stackoverflow.com/q/42434872/398670

but haven't had a chance to follow up about the Pg specifics.

I've been looking at the problem on and off and haven't come up with a good
answer. I think we should just PANIC and let redo sort it out by repeating
the failed write when it repeats work since the last checkpoint.

The API offered by async buffered writes and fsync offers us no way to find
out which page failed, so we can't just selectively redo that write. I
think we do know the relfilenode associated with the fd that failed to
fsync, but not much more. So the alternative seems to be some sort of
potentially complex online-redo scheme where we replay WAL only the
relation on which we had the fsync() error, while otherwise servicing
queries normally. That's likely to be extremely error-prone and hard to
test, and it's trying to solve a case where on other filesystems the whole
DB would grind to a halt anyway.

I looked into whether we can solve it with use of the AIO API instead, but
the mess is even worse there - from what I can tell you can't even reliably
guarantee fsync at all on all Linux kernel versions.

We already PANIC on fsync() failure for WAL segments. We just need to do
the same for data forks at least for EIO. This isn't as bad as it seems
because AFAICS fsync only returns EIO in cases where we should be stopping
the world anyway, and many FSes will do that for us.

There are rather a lot of pg_fsync() callers. While we could handle this
case-by-case for each one, I'm tempted to just make pg_fsync() itself
intercept EIO and PANIC. Thoughts?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-03-28 02:24:06 Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary
Previous Message Michael Paquier 2018-03-28 02:10:03 Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility