Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-08 02:16:07
Message-ID: CAEepm=0nqrNhGPy2-dpWd9OTM4UeztCBWqJ2Mk5hMkB90pdTcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So, what can we actually do about this new Linux behaviour?

Idea 1:

* whenever you open a file, either tell the checkpointer so it can
open it too (and wait for it to tell you that it has done so, because
it's not safe to write() until then), or send it a copy of the file
descriptor via IPC (since duplicated file descriptors share the same
f_wb_err)

* if the checkpointer can't take any more file descriptors (how would
that limit even work in the IPC case?), then it somehow needs to tell
you that so that you know that you're responsible for fsyncing that
file yourself, both on close (due to fd cache recycling) and also when
the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible. Is
there some simpler idea along these lines that could make sure that
fsync() is only ever called on file descriptors that were opened
before all unflushed writes, or file descriptors cloned from such file
descriptors?

Idea 2:

Give up, complain that this implementation is defective and
unworkable, both on POSIX-compliance grounds and on POLA grounds, and
campaign to get it fixed more fundamentally (actual details left to
the experts, no point in speculating here, but we've seen a few
approaches that work on other operating systems including keeping
buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't
work. I thought we could try asking for a new fcntl interface that
spits out wb_err counter. Call it an opaque error token or something.
Then we could store it in our fsync queue and safely close the file.
Check again before fsync()ing, and if we ever see a different value,
PANIC because it means a writeback error happened while we weren't
looking. Sadly I think it doesn't work because AIUI inodes are not
pinned in kernel memory when no one has the file open and there are no
dirty buffers, so I think the counters could go away and be reset.
Perhaps you could keep inodes pinned by keeping the associated buffers
dirty after an error (like FreeBSD), but if you did that you'd have
solved the problem already and wouldn't really need the wb_err system
at all. Is there some other idea long these lines that could work?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-04-08 02:33:37 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Edmund Horner 2018-04-08 01:59:15 Re: PATCH: psql tab completion for SELECT