Quick Links

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date:	2018-04-04 01:54:50
Message-ID:	CAEepm=2OcWtqQFAZQ26Rdhu+gkyTUYBfLdUjF3QeaMNsuMzd6w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> There has been a lot of focus in this thread on the workflow:
>
> write() -> blocks remain in kernel memory -> fsync() -> panic?
>
> But what happens in this workflow:
>
> write() -> kernel syncs blocks to storage -> fsync()
>
> Is fsync() going to see a "kernel syncs blocks to storage" failure?
>
> There was already discussion that if the fsync() causes the "syncs
> blocks to storage", fsync() will only report the failure once, but will
> it see any failure in the second workflow? There is indication that a
> failed write to storage reports back an error once and clears the dirty
> flag, but do we know it keeps things around long enough to report an
> error to a future fsync()?
>
> You would think it does, but I have to ask since our fsync() assumptions
> have been wrong for so long.

I believe there were some problems of that nature (with various
twists, based on other concurrent activity and possibly different
fds), and those problems were fixed by the errseq_t system developed
by Jeff Layton in Linux 4.13. Call that "bug #1".

The second issues is that the pages are marked clean after the error
is reported, so further attempts to fsync() the data (in our case for
a new attempt to checkpoint) will be futile but appear successful.
Call that "bug #2", with the proviso that some people apparently think
it's reasonable behaviour and not a bug. At least there is a
plausible workaround for that: namely the nuclear option proposed by
Craig.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS at 2018-04-04 00:56:37 from Bruce Momjian

Responses

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS at 2018-04-04 02:05:19 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2018-04-04 02:05:19	Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message	David Rowley	2018-04-04 01:13:57	Re: [HACKERS] path toward faster partition pruning