Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Gasper Zejn <zejn(at)owca(dot)info>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-21 19:21:39
Message-ID: 31725b87-f007-5849-4370-c25bc30ec2db@owca.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Just for the record, I tried the test case with ZFS on Ubuntu 17.10 host
with ZFS on Linux 0.6.5.11.

ZFS does not swallow the fsync error, but the system does not handle the
error nicely: the test case program hangs on fsync, the load jumps up
and there's a bunch of z_wr_iss and z_null_int kernel threads belonging
to zfs, eating up the CPU.

Even then I managed to reboot the system, so it's not a complete and
utter mess.

The test case adjustments are here:
https://github.com/zejn/scrapcode/commit/e7612536c346d59a4b69bedfbcafbe8c1079063c

Kind regards,

Gasper

On 29. 03. 2018 07:25, Craig Ringer wrote:
> On 29 March 2018 at 13:06, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com
> <mailto:thomas(dot)munro(at)enterprisedb(dot)com>> wrote:
>
> On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby
> <pryzby(at)telsasoft(dot)com <mailto:pryzby(at)telsasoft(dot)com>> wrote:
> > The retries are the source of the problem ; the first fsync()
> can return EIO,
> > and also *clears the error* causing a 2nd fsync (of the same
> data) to return
> > success.
>
> What I'm failing to grok here is how that error flag even matters,
> whether it's a single bit or a counter as described in that patch.  If
> write back failed, *the page is still dirty*.  So all future calls to
> fsync() need to try to try to flush it again, and (presumably) fail
> again (unless it happens to succeed this time around).
>
>
> You'd think so. But it doesn't appear to work that way. You can see
> yourself with the error device-mapper destination mapped over part of
> a volume.
>
> I wrote a test case here.
>
> https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c
>
> I don't pretend the kernel behaviour is sane. And it's possible I've
> made an error in my analysis. But since I've observed this in the
> wild, and seen it in a test case, I strongly suspect that's what I've
> described is just what's happening, brain-dead or no.
>
> Presumably the kernel marks the page clean when it dispatches it to
> the I/O subsystem and doesn't dirty it again on I/O error? I haven't
> dug that deep on the kernel side. See the stackoverflow post for
> details on what I found in kernel code analysis.
>
> --
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2018-04-21 22:58:53 Re: Toast issues with OldestXmin going backwards
Previous Message Andres Freund 2018-04-21 18:25:05 Re: Toast issues with OldestXmin going backwards