Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Andres Freund <andres(at)anarazel(dot)de>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Christophe Pettus <xof(at)thebuild(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-09 02:06:12
Message-ID: 20180409020612.4leuhu2e7p7egvxq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:
> I suspect we've written off a fair few issues in the past as "it'd bad
> hardware" when actually, the hardware fault was the trigger for a Pg/kernel
> interaction bug. And blamed containers for things that weren't really the
> container's fault. But even so, if it were happening tons, we'd hear more
> noise.

Agreed on that, but I think that's FAR more likely to be things like
multixacts, index structure corruption due to logic bugs etc.

> I've already been very surprised there when I learned that PostgreSQL
> completely ignores wholly absent relfilenodes. Specifically, if you
> unlink() a relation's backing relfilenode while Pg is down and that file
> has writes pending in the WAL. We merrily re-create it with uninitalized
> pages and go on our way. As Andres pointed out in an offlist discussion,
> redo isn't a consistency check, and it's not obliged to fail in such cases.
> We can say "well, don't do that then" and define away file losses from FS
> corruption etc as not our problem, the lower levels we expect to take care
> of this have failed.

And it'd be a realy bad idea to behave differently.

> And in many failure modes there's no reason to expect any data loss at all,
> like:
>
> * Local disk fills up (seems to be safe already due to space reservation at
> write() time)

That definitely should be treated separately.

> * Thin-provisioned storage backing local volume iSCSI or paravirt block
> device fills up
> * NFS volume fills up

Those should be the same as the above.

> I think we need to think about a more robust path in future. But it's
> certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least
two orders of magnitude.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-04-09 03:03:51 Re: pgsql: Support partition pruning at execution time
Previous Message Craig Ringer 2018-04-09 02:00:41 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS