Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: Craig Ringer <craig(at)2ndQuadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-08 22:29:16
Message-ID: 20180408222916.GA9257@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
>
> > On Apr 8, 2018, at 03:30, Craig Ringer <craig(at)2ndQuadrant(dot)com>
> > wrote:
> >
> > These are way more likely than bit flips or other storage level
> > corruption, and things that we previously expected to detect and
> > fail gracefully for.
>
> This is definitely bad, and it explains a few otherwise-inexplicable
> corruption issues we've seen. (And great work tracking it down!) I
> think it's important not to panic, though; PostgreSQL doesn't have a
> reputation for horrible data integrity. I'm not sure it makes sense
> to do a major rearchitecting of the storage layer (especially with
> pluggable storage coming along) to address this. While the failure
> modes are more common, the solution (a PITR backup) is one that an
> installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping
Postgres at the time the kernel reports the errors to the kernel log, so
we are then returning potentially incorrect results and committing
transactions that might be wrong or lost. If we could stop Postgres
when such errors happen, at least the administrator could fix the
problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops
Postgres when it seems something wrong.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christophe Pettus 2018-04-08 23:10:24 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Tom Lane 2018-04-08 21:54:31 Re: pgsql: Support partition pruning at execution time