Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-06 03:20:22
Message-ID: CAMsr+YGXcjheOijrUhqz7vtMe0C7uqx=6HpxTAuntaHnZ5K5uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 April 2018 at 10:53, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> > On 6 April 2018 at 07:37, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
> wrote:
> >> Note: as I've brought up in another thread, it turns out that PG is not
> >> handling fsync errors correctly even when the OS _does_ do the right
> >> thing (discovered by testing on FreeBSD).
> >
> > Yikes. For other readers, the related thread for this is
> > https://www.postgresql.org/message-id/87y3i1ia4w.fsf@
> news-spur.riddles.org.uk
>
> Yeah. That's really embarrassing, especially after beating up on
> various operating systems all week. It's also an independent issue --
> let's keep that on the other thread and get it fixed.
>
> > I see the failed fync, then the same fd being fsync()d without error on
> the
> > next checkpoint, which succeeds.
> >
> > postgres 9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:
> > 0x00000005
> > postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync:
> > 0xfffffffffffffffb
> > ...
> > postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd:
> > 0x00000005
> > postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0
> >
> > ... and Pg continues merrily on its way without realising it lost data:
> >
> > [72379.834872] XFS (dm-0): writeback error on sector 118752
> > [72380.324707] XFS (dm-0): writeback error on sector 118688
> >
> > In this test I set things up so the checkpointer would see the first
> fsync()
> > error. But if I make checkpoints less frequent, the bgwriter aggressive,
> and
> > kernel dirty writeback aggressive, it should be possible to have the
> failure
> > go completely unobserved too. I'll try that next, because we've already
> > largely concluded that the solution to the issue above is to PANIC on
> > fsync() error. But if we don't see the error at all we're in trouble.
>
> I suppose you only see errors because the file descriptors linger open
> in the virtual file descriptor cache, which is a matter of luck
> depending on how many relation segment files you touched.

In this case I think it's because the kernel didn't get around to doing the
writeback before the eagerly forced checkpoint fsync()'d it. Or we didn't
even queue it for writeback from our own shared_buffers until just before
we fsync()'d it. After all, it's a contrived test case that tries to
reproduce the issue rapidly with big writes and frequent checkpoints.

So the checkpointer had the relation open to fsync() it, and it was the
checkpointer's fsync() that did writeback on the dirty page and noticed the
error.

If we the kernel had done the writeback before the checkpointer opened the
relation to fsync() it, we might not have seen the error at all - though as
you note this depends on the file descriptor cache. You can see the
silent-error behaviour in my standalone test case where I confirmed the
post-4.13 behaviour. (I'm on 4.14 here).

I can try to reproduce it with postgres too, but it not only requires
closing and reopening the FDs, it also requires forcing writeback before
opening the fd. To make it occur in a practical timeframe I have to make my
kernel writeback settings insanely aggressive and/or call sync() before
re-open()ing. I don't really think it's worth it, since I've confirmed the
behaviour already with the simpler test in standalone/ in the rest repo. To
try it yourself, clone

https://github.com/ringerc/scrapcode

and in the master branch

cd testcases/fsync-error-clear
less README
make REOPEN=reopen standalone-run

See
https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear/standalone/fsync-error-clear.c#L118
.

I've pushed the postgres test to that repo too; "make postgres-run".

You'll need docker, and be warned, it's using privileged docker containers
and messing with dmsetup.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-04-06 03:27:16 Re: [HACKERS] Runtime Partition Pruning
Previous Message Peter Eisentraut 2018-04-06 03:08:34 Re: csv format for psql