Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-04 06:00:21
Message-ID: CAMsr+YGiVR_BWSPjdra+0DbHfyfYB7=DxBmg2f59wf_uvM0zxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> >> Uh, are you sure it fixes our use-case? From the email description it
> >> sounded like it only reported fsync errors for every open file
> >> descriptor at the time of the failure, but the checkpoint process might
> >> open the file _after_ the failure and try to fsync a write that happened
> >> _before_ the failure.
> >
> > I'm not sure of anything. I can see that it's designed to report
> > errors since the last fsync() of the *file* (presumably via any fd),
> > which sounds like the desired behaviour:
> >
> > [..]
>
> Scratch that. Whenever you open a file descriptor you can't see any
> preceding errors at all, because:
>
> /* Ensure that we skip any errors that predate opening of the file */
> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
>
> https://github.com/torvalds/linux/blob/master/fs/open.c#L752
>
> Our whole design is based on being able to open, close and reopen
> files at will from any process, and in particular to fsync() from a
> different process that didn't inherit the fd but instead opened it
> later. But it looks like that might be able to eat errors that
> occurred during asynchronous writeback (when there was nobody to
> report them to), before you opened the file?
>

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
will deliberately hide writeback errors that predate our fsync() call from
us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it
to make it easier for others to use, but that turns out to be a major pain
when using devmapper etc. Docker in privileged mode doesn't seem to play
nice with device-mapper.

Does that mean that the ONLY ways to do reliable I/O are:

- single-process, single-file-descriptor write() then fsync(); on failure,
retry all work since last successful fsync()

or

- direct I/O

?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-04-04 06:27:48 Re: [HACKERS] Runtime Partition Pruning
Previous Message Amit Langote 2018-04-04 05:42:58 Re: [HACKERS] path toward faster partition pruning