fsync reliability

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: fsync reliability
Date: 2011-04-21 08:26:06
Message-ID: BANLkTinE_Syc3Fh+-F2LhSuiZktMHehBfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Daniel Farina points out to me that the Linux man page for fsync() says
"Calling fsync() does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an
explicit fsync() on a
file descriptor for the directory is also needed."
http://www.kernel.org/doc/man-pages/online/pages/man2/fsync.2.html

That phrase does not exist here
http://pubs.opengroup.org/onlinepubs/007908799/xsh/fsync.html

This point appears to have been discussed before
http://postgresql.1045698.n5.nabble.com/ALTER-DATABASE-SET-TABLESPACE-vs-crash-safety-td1995703.html

Tom said
"We don't try to "fsync the
directory" after a normal table create for instance"

which is fine because we don't need to. In the event of a crash a
missing table would be recreated during crash recovery.

However, that begs the question of what happens with WAL. At present,
we do nothing to ensure that "the entry in the directory containing
the file has also reached disk".

ISTM that we can easily do this, since we preallocate WAL files during
RemoveOldXlogFiles() and rarely extend the number of files.
So it seems easily possible to fsync the pg_xlog directory at the end
of RemoveOldXlogFiles(), which is mostly performed by the bgwriter
anyway.

It was also noted that "we've always expected the filesystem to take
care of its own metadata"
which isn't actually stated anywhere in the docs, AFAIK.

Perhaps this is an irrelevant problem these days, but would it hurt to fix?

Happy to do the patch if we agree.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message rajibdk 2011-04-21 09:31:20 Re: database system identifier differs between the primary and standby
Previous Message tomas 2011-04-21 06:43:46 Re: Formatting Curmudgeons WAS: MMAP Buffers