Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Christophe Pettus <xof(at)thebuild(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-09 03:15:01
Message-ID: CAMsr+YFjFrv2SH1=W-Z2OL3-87bTN5NBwQbnOkyUdPAFjQ78nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9 April 2018 at 10:06, Andres Freund <andres(at)anarazel(dot)de> wrote:

>
> > And in many failure modes there's no reason to expect any data loss at
> all,
> > like:
> >
> > * Local disk fills up (seems to be safe already due to space reservation
> at
> > write() time)
>
> That definitely should be treated separately.
>

It is, because all the FSes I looked at reserve space before returning from
write(), even if they do delayed allocation. So they won't fail with ENOSPC
at fsync() time or silently due to lost errors on background writeback.
Otherwise we'd be hearing a LOT more noise about this.

> > * Thin-provisioned storage backing local volume iSCSI or paravirt block
> > device fills up
> > * NFS volume fills up
>
> Those should be the same as the above.
>

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from
write(), even if mounted with the sync option. So we can get ENOSPC lazily
when the buffer writeback fails due to a full backing file system. This
then travels the same paths as EIO: we fsync(), ERROR, retry, appear to
succeed, and carry on with life losing the data. Or we never hear about the
error in the first place.

(There's a proposed extension that'd allow this, see
https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5,
but I see no mention of it in fs/nfs. All the reserve_space /
xdr_reserve_space stuff seems to be related to space in protocol messages
at a quick read.)

Thin provisioned storage could vary a fair bit depending on the
implementation. But the specific failure case I saw, prompting this thread,
was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was
iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux *thought*
there was plenty of space on the volume; LVM thought it had plenty of
physical extents free and unallocated, XFS thought there was tons of free
space, etc. The space exhaustion manifested as I/O errors on flushes of
writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]
kernel: Sense Key : Data Protect [current]
kernel: sd 2:0:0:1: [sdd]
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB:
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787

The immediate cause was that Linux's multipath driver didn't seem to
recognise the sense code as retryable, so it gave up and reported it to the
next layer up (LVM). LVM and XFS both seem to think that the lower layer is
responsible for retries, so they toss the write away, and tell any
interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on
checkpoints, but it only reported an error once per relfilenode. Once it
ran out of failed relfilenodes to cause the checkpoint to ERROR, it
"completed" a "successful" checkpoint and kept on running until the
resulting corruption started to manifest its self and it segfaulted some
time later. As we've now learned, there's no guarantee we'd even get the
news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we
didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news
of write failures reliably), there'd have been no corruption and no data
loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be
corrupted if my thin-provisioned volume runs out of space" I'd have said
"Surely not. PostgreSQL won't be corrupted by running out of disk space, it
orders writes carefully and forces flushes so that it will recover
gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the *default* for multipath is to give up on
errors anyway; see the queue_if_no_path option and no_path_retries options.
(Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if
you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.

I regularly see rather a lot of multipath systems, iSCSI systems, SAN
backed systems, etc. I think we need to be pretty clear that we expect them
to retry indefinitely, and if they report an I/O error we cannot reliably
handle it. We need to patch Pg to PANIC on any fsync() failure and document
that Pg won't notice some storage failure modes that might otherwise be
considered nonfatal or transient, so very specific storage configuration
and testing is required. (Not that anyone will do it). Also warn against
running on NFS even with "hard,sync,nointr".

It'd be interesting to have a tool that tested error handling, allowing
people to do iSCSI plug-pull tests, that sort of thing. But as far as I can
tell nobody ever tests their storage stack anyway, so I don't plan on
writing something that'll never get used.

> > I think we need to think about a more robust path in future. But it's
> > certainly not "stop the world" territory.
>
> I think you're underestimating the complexity of doing that by at least
> two orders of magnitude.

Oh, it's just a minor total rewrite of half Pg, no big deal ;)

I'm sure that no matter how big I think it is, I'm still underestimating it.

The most workable option IMO would be some sort of fnotify/dnotify/whatever
that reports all I/O errors on a volume. Some kind of error reporting
handle we can keep open on a volume level that we can check for each
volume/tablespace after we fsync() everything to see if it all really
worked. If we PANIC if that gives us a bad answer, and PANIC on fsync
errors, we guard against the great majority of these sorts of
should-be-transient-if-the-kernel-didn't-give-up-and-throw-away-our-data
errors.

Even then, good luck getting those events from an NFS volume in which the
backing volume experiences an issue.

And it's kind of moot because AFAICS no such interface exists.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-04-09 03:48:57 Re: [HACKERS] path toward faster partition pruning
Previous Message Tom Lane 2018-04-09 03:05:09 Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.