Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Catalin Iacob <iacobcatalin(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-03-29 16:20:00
Message-ID: CAHg_5gqXwiun=inh=2QomnvqvRYb_jrYcnfiEekZ=hQKbY2XBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> I found your discussion with kernel hacker Jeff Layton at
> https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
> writeup seems to want a scheme where pages stay dirty after a
> writeback failure so that we can try to fsync them again. Note that
> that has never been the case in Linux after hard writeback failures,
> AFAIK, so programs should definitely not assume that behavior."

And a bit below in the same comments, to this question about PG: "So,
what are the options at this point? The assumption was that we can
repeat the fsync (which as you point out is not the case), or shut
down the database and perform recovery from WAL", the same Jeff Layton
seems to agree PANIC is the appropriate response:
"Replaying the WAL synchronously sounds like the simplest approach
when you get an error on fsync. These are uncommon occurrences for the
most part, so having to fall back to slow, synchronous error recovery
modes when this occurs is probably what you want to do.".
And right after, he confirms the errseq_t patches are about always
detecting this, not more:
"The main thing I working on is to better guarantee is that you
actually get an error when this occurs rather than silently corrupting
your data. The circumstances where that can occur require some
corner-cases, but I think we need to make sure that it doesn't occur."

Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

> The article above that says the same thing a couple of different ways,
> ie that writeback failure leaves you with pages that are neither
> written to disk successfully nor marked dirty.
>
> If I'm reading various articles correctly, the situation was even
> worse before his errseq_t stuff landed. That fixed cases of
> completely unreported writeback failures due to sharing of PG_error
> for both writeback and read errors with certain filesystems, but it
> doesn't address the clean pages problem.

Indeed, that's exactly how I read it as well (opinion formed
independently before reading your sentence above). The errseq_t
patches landed in v4.13 by the way, so very recently.

> Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2018-03-29 16:26:11 Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
Previous Message Bruce Momjian 2018-03-29 16:18:43 Re: [HACKERS] [POC] Faster processing at Gather node