From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date: | 2018-04-04 02:05:19 |
Message-ID: | 20180404020519.GB25202@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > There has been a lot of focus in this thread on the workflow:
> >
> > write() -> blocks remain in kernel memory -> fsync() -> panic?
> >
> > But what happens in this workflow:
> >
> > write() -> kernel syncs blocks to storage -> fsync()
> >
> > Is fsync() going to see a "kernel syncs blocks to storage" failure?
> >
> > There was already discussion that if the fsync() causes the "syncs
> > blocks to storage", fsync() will only report the failure once, but will
> > it see any failure in the second workflow? There is indication that a
> > failed write to storage reports back an error once and clears the dirty
> > flag, but do we know it keeps things around long enough to report an
> > error to a future fsync()?
> >
> > You would think it does, but I have to ask since our fsync() assumptions
> > have been wrong for so long.
>
> I believe there were some problems of that nature (with various
> twists, based on other concurrent activity and possibly different
> fds), and those problems were fixed by the errseq_t system developed
> by Jeff Layton in Linux 4.13. Call that "bug #1".
So all our non-cutting-edge Linux systems are vulnerable and there is no
workaround Postgres can implement? Wow.
> The second issues is that the pages are marked clean after the error
> is reported, so further attempts to fsync() the data (in our case for
> a new attempt to checkpoint) will be futile but appear successful.
> Call that "bug #2", with the proviso that some people apparently think
> it's reasonable behaviour and not a bug. At least there is a
> plausible workaround for that: namely the nuclear option proposed by
> Craig.
Yes, that one I understood.
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2018-04-04 02:10:54 | Re: [HACKERS] Runtime Partition Pruning |
Previous Message | Thomas Munro | 2018-04-04 01:54:50 | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |