Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Christophe Pettus <xof(at)thebuild(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-09 02:00:41
Message-ID: CAMsr+YFsrjzj8oisCcrTo3RB35D_kAmdd0VOOUQwqxtQw6LS_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9 April 2018 at 07:16, Andres Freund <andres(at)anarazel(dot)de> wrote:

>
> I think the danger presented here is far smaller than some of the
> statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about
Pg eating people's data, people shouting about how unreliable it is, etc.
We don't. So it's not some earth shattering imminent threat to everyone's
data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad
hardware" when actually, the hardware fault was the trigger for a Pg/kernel
interaction bug. And blamed containers for things that weren't really the
container's fault. But even so, if it were happening tons, we'd hear more
noise.

I've already been very surprised there when I learned that PostgreSQL
completely ignores wholly absent relfilenodes. Specifically, if you
unlink() a relation's backing relfilenode while Pg is down and that file
has writes pending in the WAL. We merrily re-create it with uninitalized
pages and go on our way. As Andres pointed out in an offlist discussion,
redo isn't a consistency check, and it's not obliged to fail in such cases.
We can say "well, don't do that then" and define away file losses from FS
corruption etc as not our problem, the lower levels we expect to take care
of this have failed.

We have to look at what checkpoints are and are not supposed to promise,
and whether this is a problem we just define away as "not our problem, the
lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power
loss safety ONLY and do not attempt to protect against I/O errors of any
sort. In fact, I think we should likely amend the documentation for release
versions to say just that.

In all likelihood, once
> you've got an IO error that kernel level retries don't fix, your
> database is screwed.

Your database is going to be down or have interrupted service. It's
possible you may have some unreadable data. This could result in localised
damage to one or more relations. That could affect FK relationships,
indexes, all sorts. If you're really unlucky you might lose something
critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all,
like:

* Local disk fills up (seems to be safe already due to space reservation at
write() time)
* Thin-provisioned storage backing local volume iSCSI or paravirt block
device fills up
* NFS volume fills up
* Multipath I/O error
* Interruption of connectivity to network block device
* Disk develops localized bad sector where we haven't previously written
data

Except for the ENOSPC on NFS, all the rest of the cases can be handled by
expecting the kernel to retry forever and not return until the block is
written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel *won't* retry forever in all these
cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of
I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really
> somewhat besides the point. We don't panic that way when getting IO
> errors during reads either, and they're more likely to be persistent
> than errors during writes (because remapping on storage layer can fix
> issues, but not during reads).
>

That's because reads don't make promises about what's committed and synced.
I think that's quite different.

> We should fix things so that reported errors are treated with crash
> recovery, and for the rest I think there's very fair arguments to be
> made that that's far outside postgres's remit.
>

Certainly for current versions.

I think we need to think about a more robust path in future. But it's
certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim
responsibility for I/O errors on async writes, and that the kernel and I/O
stack must be configured never to give up on buffered writes. If it does,
that's not our problem anymore.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-04-09 02:06:12 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Andres Freund 2018-04-09 01:55:10 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS