Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-03-29 05:00:31
Message-ID: 20180329050031.GG28454@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> >
> > Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
> https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

"Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

1 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
2 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
3 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
4 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
5 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
6 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
7 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
8 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
9 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device)
10 dup(2) = 4
11 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
12 brk(NULL) = 0x1299000
13 brk(0x12ba000) = 0x12ba000
14 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15 write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
16 ) = 34
17 close(4) = 0
18 fsync(3) = -1 EIO (Input/output error)
19 dup(2) = 4
20 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
21 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22 write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
23 ) = 29
24 close(4) = 0
25 close(3) = 0
26 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27 fsync(3) = 0
28 write(3, "\0", 1) = 1
29 fsync(3) = 0
30 exit_group(0) = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin

Attachment Content-Type Size
eio.c text/x-csrc 1.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-03-29 05:06:22 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Kyotaro HORIGUCHI 2018-03-29 04:56:20 Re: PQHost() undefined behavior if connecting string contains both host and hostaddr types