Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Greg Stark <stark(at)mit(dot)edu>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-08 21:23:21
Message-ID: CAM-w4HMkn6vgxozFGCMK-X_P+Gwaxy1HK3bviLpFE+h0BZ3-4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8 April 2018 at 04:27, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
> wrote:
>
> If the kernel does writeback in the middle, how on earth is it supposed to
> know we expect to reopen the file and check back later?
>
> Should it just remember "this file had an error" forever, and tell every
> caller? In that case how could we recover? We'd need some new API to say
> "yeah, ok already, I'm redoing all my work since the last good fsync() so
> you can clear the error flag now". Otherwise it'd keep reporting an error
> after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
track of any errors. We just need fsync to accurately report whether
all the buffers in the file have been written out. When you call fsync
again the kernel needs to initiate i/o on all the dirty buffers and
block until they complete successfully. If they complete successfully
then nobody cares whether they had some failure in the past when i/o
was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The
problem is that dirty buffers are being marked clean when they haven't
been written out. They consider dirty filesystem buffers when there's
hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then
there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel
filesystem buffer provides no reliable interface so we need to
reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we
have the added barrier that we don't have as much information about
the hardware as the kernel has. We don't know where raid stripes begin
and end, how big the memory controller buffers are or how to tell when
they're full or empty or how to flush them. etc etc. We also don't
know what else is going on on the machine.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christophe Pettus 2018-04-08 21:28:43 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Andres Freund 2018-04-08 19:09:01 Re: WIP: a way forward on bootstrap data