Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-02 23:27:35
Message-ID: CAMsr+YEp21emoQG_dEqgUxckGNgZ8oQfab0giK6XLTyFbRY02Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 April 2018 at 07:05, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:

> Hi Stephen,
>
> On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
> >
> > fsync() doesn't reflect the status of given pages, however, it reflects
> > the status of the file descriptor, and as such the file, on which it's
> > called. This notion that fsync() is actually only responsible for the
> > changes which were made to a file since the last fsync() call is pure
> > foolishness. If we were able to pass a list of pages or data ranges to
> > fsync() for it to verify they're on disk then perhaps things would be
> > different, but we can't, all we can do is ask to "please flush all the
> > dirty pages associated with this file descriptor, which represents this
> > file we opened, to disk, and let us know if you were successful."
> >
> > Give us a way to ask "are these specific pages written out to persistant
> > storage?" and we would certainly be happy to use it, and to repeatedly
> > try to flush out pages which weren't synced to disk due to some
> > transient error, and to track those cases and make sure that we don't
> > incorrectly assume that they've been transferred to persistent storage.
>
> Indeed fsync() is simply a rather blunt instrument and a narrow legacy
> interface but further changing its established semantics (no matter how
> unreasonable they may be) is probably not the way to go.
>

They're undocumented and extremely surprising semantics that are arguably a
violation of the POSIX spec for fsync(), or at least a surprising
interpretation of it.

So I don't buy this argument.

It really turns out that this is not how the fsync() semantics work
> though, exactly because the nature of the errors: even if the kernel
> retained the dirty bits on the failed pages, retrying persisting them
> on the same disk location would simply fail.

*might* simply fail.

It depends on why the error ocurred.

I originally identified this behaviour on a multipath system. Multipath
defaults to "throw the writes away, nobody really cares anyway" on error.
It seems to figure a higher level will retry, or the application will
receive the error and retry.

(See no_path_retry in multipath config. AFAICS the default is insanely
dangerous and only suitable for specialist apps that understand the quirks;
you should use no_path_retry=queue).

Instead the kernel opts
> for marking those pages clean (since there is no other recovery
> strategy),

and reporting once to the caller who can potentially deal
> with it in some manner. It is sadly a bad and undocumented convention.
>

It could mark the FD.

It's not just undocumented, it's a slightly creative interpretation of the
POSIX spec for fsync.

> > Consider two independent programs where the first one writes to a file
> > and then calls the second one whose job it is to go out and fsync(),
> > perhaps async from the first, those files. Is the second program
> > supposed to go write to each page that the first one wrote to, in order
> > to ensure that all the dirty bits are set so that the fsync() will
> > actually return if all the dirty pages are written?
>
> I think what you have in mind are the semantics of sync() rather
> than fsync(), but as long as an application needs to ensure data
> are persisted to storage, it needs to retain those data in its heap
> until fsync() is successful instead of discarding them and relying
> on the kernel after write().

This is almost exactly what we tell application authors using PostgreSQL:
the data isn't written until you receive a successful commit confirmation,
so you'd better not forget it.

We provide applications with *clear boundaries* so they can know *exactly*
what was, and was not, written. I guess the argument from the kernel is the
same is true: whatever was written since the last *successful* fsync is
potentially lost and must be redone.

But the fsync behaviour is utterly undocumented and dubiously standard.

> I think we are conflating a few issues here: having the OS kernel being
> responsible for error recovery (so that subsequent fsync() would fix
> the problems) is one. This clearly is a design which most kernels have
> not really adopted for reasons outlined above

[citation needed]

What do other major platforms do here? The post above suggests it's a bit
of a mix of behaviours.

Now, there is the issue of granularity of
> error reporting: userspace could benefit from a fine-grained indication
> of failed pages (or file ranges).

Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map
a sync failure back to an individual AIO write.

But it seems AIO just adds more problems and fixes none. Flush behaviour
with AIO from what I can tell is inconsistent version to version and
generally unhelpful. The kernel should really report such sync failures
back to the app on its AIO write mapping, but it seems nothing of the sort
happens.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-02 23:27:55 Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Previous Message Alexander Korotkov 2018-04-02 23:27:33 Re: WIP: Covering + unique indexes.