Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-02 23:23:24
Message-ID: 20180402232324.hdt2345l6srm2upw@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:
> Would using sync_file_range() be helpful? Potential errors would only
> apply to pages that cover the requested file ranges. There are a few
> caveats though:

To quote sync_file_range(2):
Warning
This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the
file's metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there
are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an over‐
write. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When
writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not
sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with
volatile disk write caches.

Given the lack of metadata safety that seems entirely a no go. We use
sfr(2), but only to force the kernel's hand around writing back earlier
without throwing away cache contents.

> > > The application will need to deal with that first error irrespective of
> > > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > > demarcates an epoch for which it reports potential errors, so the caller
> > > needs to take responsibility for that particular epoch.
> >
> > We do deal with that error- by realizing that it failed and later
> > *retrying* the fsync(), which is when we get back an "all good!
> > everything with this file descriptor you've opened is sync'd!" and
> > happily expect that to be truth, when, in reality, it's an unfortunate
> > lie and there are still pages associated with that file descriptor which
> > are, in reality, dirty and not sync'd to disk.
>
> It really turns out that this is not how the fsync() semantics work
> though

Except on freebsd and solaris, and perhaps others.

>, exactly because the nature of the errors: even if the kernel
> retained the dirty bits on the failed pages, retrying persisting them
> on the same disk location would simply fail.

That's not guaranteed at all, think NFS.

> Instead the kernel opts for marking those pages clean (since there is
> no other recovery strategy), and reporting once to the caller who can
> potentially deal with it in some manner. It is sadly a bad and
> undocumented convention.

It's broken behaviour justified post facto with the only rational that
was available, which explains why it's so unconvincing. You could just
say "this ship has sailed, and it's to onerous to change because xxx"
and this'd be a done deal. But claiming this is reasonable behaviour is
ridiculous.

Again, you could just continue to error for this fd and still throw away
the data.

> > Consider two independent programs where the first one writes to a file
> > and then calls the second one whose job it is to go out and fsync(),
> > perhaps async from the first, those files. Is the second program
> > supposed to go write to each page that the first one wrote to, in order
> > to ensure that all the dirty bits are set so that the fsync() will
> > actually return if all the dirty pages are written?
>
> I think what you have in mind are the semantics of sync() rather
> than fsync()

If you open the same file with two fds, and write with one, and fsync
with another that's definitely supposed to work. And sync() isn't a
realistic replacement in any sort of way because it's obviously
systemwide, and thus entirely and completely unsuitable. Nor does it
have any sort of better error reporting behaviour, does it?

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-04-02 23:27:33 Re: WIP: Covering + unique indexes.
Previous Message Tatsuo Ishii 2018-04-02 23:09:18 Re: Creating streaming replication standby