Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-02 15:03:42
Message-ID: CAMsr+YHtosoQKzHh-nAmyG75cAPTzTtwyk871d+1O-sNQRdeyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2 April 2018 at 02:24, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

>
> Maybe my drive-by assessment of those kernel routines is wrong and
> someone will correct me, but I'm starting to think you might be better
> to assume the worst on all systems. Perhaps a GUC that defaults to
> panicking, so that users on those rare OSes could turn that off? Even
> then I'm not sure if the failure mode will be that great anyway or if
> it's worth having two behaviours. Thoughts?
>
>
I see little benefit to not just PANICing unconditionally on EIO, really.
It shouldn't happen, and if it does, we want to be pretty conservative and
adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be
necessary from what I recall finding in my test case + kernel code reading.
I really don't want to respond to a possibly-transient ENOSPC by PANICing
the whole server unnecessarily.

BTW, the support team at 2ndQ is presently working on two separate issues
where ENOSPC resulted in DB corruption, though neither of them involve logs
of lost page writes. I'm planning on taking some time tomorrow to write a
torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in
the test case I linked to in my original StackOverflow post.

If this is just an EIO issue then I see no point doing anything other than
PANICing unconditionally.

If it's a concern for ENOSPC too, we should try harder to fail more nicely
whenever we possibly can.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2018-04-02 15:04:16 Re: Change RangeVarGetRelidExtended() to take flags argument?
Previous Message Robert Haas 2018-04-02 14:57:15 Re: [HACKERS] Partition-wise aggregation/grouping