Postgres, fsync, and OSs (specifically linux)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Postgres, fsync, and OSs (specifically linux)
Date: 2018-04-27 22:28:42
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I thought I'd send this separately from [0] as the issue has become more
general than what was mentioned in that thread, and it went off into
various weeds.

I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say
it was a very productive discussion. I'll first try to recap the
current situation, updated with knowledge I gained. Secondly I'll try to
discuss the kernel changes that seem to have been agreed upon. Thirdly
I'll try to sum up what postgres needs to change.

== Current Situation ==

The fundamental problem is that postgres assumed that any IO error would
be reported at fsync time, and that the error would be reported until
resolved. That's not true in several operating systems, linux included.

There's various judgement calls leading to the current OS (specifically
linux, but the concerns are similar in other OSs) behaviour:

- By the time IO errors are treated as fatal, it's unlikely that plain
retries attempting to write exactly the same data are going to
succeed. There are retries on several layers. Some cases would be
resolved by overwriting a larger amount (so device level remapping
functionality can mask dead areas), but plain retries aren't going to
get there if they didn't the first time round.
- Retaining all the data necessary for retries would make it quite
possible to turn IO errors on some device into out of memory
errors. This is true to a far lesser degree if only enough information
were to be retained to (re-)report an error, rather than actually
retry the write.
- Continuing to re-report an error after one fsync() failed would make
it hard to recover from that fact. There'd need to be a way to "clear"
a persistent error bit, and that'd obviously be outside of posix.
- Some other databases use direct-IO and thus these paths haven't been
exercised under fire that much.
- Actually marking files as persistently failed would require filesystem
changes, and filesystem metadata IO, far from guaranteed in failure

Before linux v4.13 errors in kernel writeback would be reported at most
once, without a guarantee that that'd happen (IIUC memory pressure could
lead to the relevant information being evicted) - but it was pretty
likely. After v4.13 (see errors are
reported exactly once to all open file descriptors for a file with an
error - but never for files that have been opened after the error

It's worth to note that on linux it's not well defined what contents one
would read after a writeback error. IIUC xfs will mark the pagecache
contents that triggered an error as invalid, triggering a re-read from
the underlying storage (thus either failing or returning old but
persistent contents). Whereas some other filesystems (among them ext4 I
believe) retain the modified contents of the page cache, but marking it
as clean (thereby returning new contents until the page cache contents
are evicted).

Some filesystems (prominently NFS in many configurations) perform an
implicit fsync when closing the file. While postgres checks for an error
of close() and reports it, we don't treat it as fatal. It's worth to
note that by my reading this means that an fsync error at close() will
*not* be re-reported by the time an explicit fsync() is issued. It also
means that we'll not react properly to the possible ENOSPC errors that
may be reported at close() for NFS. At least the latter isn't just the
case in linux.

Proposals for how postgres could deal with this included using syncfs(2)
- but that turns out not to work at all currently, because syncfs()
basically wouldn't return any file-level errors. It'd also imply
superflously flushing temporary files etc.

The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.

Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write(). In a lot of cases it's
pretty much expected that the file system will just hang or react
unpredictably upon space exhaustion. My reading is that the block-layer
thin provisioning code is still pretty fresh, and should only be used
with great care. The only way to halfway reliably use it appears to
change the configuration so space exhaustion blocks until admin
intervention (at least dm-thinp provides allows that).

There's some clear need to automate some more testing in this area so
that future behaviour changes don't surprise us.

== Proposed Linux Changes ==

- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.

This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.

- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.

No patch has appeared yet, but the behaviour seems largely agreed

- Make per-filesystem error counts available in a uniform (i.e. same for
every supporting fs) manner. Right now it's very hard to figure out
whether errors occurred. There seemed general agreement that exporting
knowledge about such errors is desirable. Quite possibly the syncfs()
fix above will provide the necessary infrastructure. It's unclear as
of yet how the value would be exposed. Per-fs /sys/ entries and an
ioctl on O_PATH fds have been mentioned.

These'd error counts would not vanish due to memory pressure, and they
can be checked even without knowing which files in a specific
filesystem have been touched (e.g. when just untar-ing something).

There seemed to be fairly widespread agreement that this'd be a good
idea. Much less clearer whether somebody would do the work.

- Provide config knobs that allow to define the FS error behaviour in a
consistent way across supported filesystems. XFS currently has various
knobs controlling what happens in case of metadata errors [1] (retry
forever, timeout, return up). It was proposed that this interface be
extended to also deal with data errors, and moved into generic support

While the timeline is unclear, there seemed to be widespread support
for the idea. I believe Dave Chinner indicated that he at least has
plans to generalize the code.

- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.

I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.

== Potential Postgres Changes ==

Several operating systems / file systems behave differently (See
e.g. [2], thanks Thomas) than we expected. Even the discussed changes to
e.g. linux don't get to where we thought we are. There's obviously also
the question of how to deal with kernels / OSs that have not been

Changes that appear to be necessary, even for kernels with the issues

- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
retry recovery. While ENODEV (underlying device went away) will be
persistent, it probably makes sense to treat it the same or even just
give up and shut down. One question I see here is whether we just
want to continue crash-recovery cycles, or whether we want to limit

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

Robert, on IM, wondered whether there'd be a race between some backend
doing a close(), triggering a PANIC, and a checkpoint succeeding. I
don't *think* so, because the error will only happen if there's
outstanding dirty data, and the checkpoint would have flushed that out
if it belonged to the current checkpointing cycle.

- The outstanding fsync request queue isn't persisted properly [3]. This
means that even if the kernel behaved the way we'd expected, we'd not
fail a second checkpoint :(. It's possible that we don't need to deal
with this because we'll henceforth PANIC, but I'd argue we should fix
that regardless. Seems like a time-bomb otherwise (e.g. after moving
to DIO somebody might want to relax the PANIC...).

- It might be a good idea to whitelist expected return codes for write()
and PANIC one ones that we did not expect. E.g. when hitting an EIO we
should probably PANIC, to get back to a known good state. Even though
it's likely that we'd again that error at fsync().

- Docs.

I think we also need to audit a few codepaths. I'd be surprised if we
PANICed appropriately on all fsyncs(), particularly around the SLRUs. I
think we need to be particularly careful around the WAL handling, I
think it's fairly likely that there's cases where we'd write out WAL in
one backend and then fsync() in another backend with a file descriptor
that has only been opened *after* the write occurred, which means we
might miss the error entirely.

Then there's the question of how we want to deal with kernels that
haven't been updated with the aforementioned changes. We could say that
we expect decent OS support and declare that we just can't handle this -
given that at least various linux versions, netbsd, openbsd, MacOS just
silently drop errors and we'd need different approaches for dealing with
that, that doesn't seem like an insane approach.

What we could do:

- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.

This should be doable without a noticeable performance impact, I
believe. I don't think it'd be that hard either, but it'd be a bit of
a pain to backport it to all postgres versions, as well as a bit
invasive for that.

The infrastructure this'd likely end up building (hashtable of open
relfilenodes), would likely be useful for further things (like caching
file size).

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

- magic


Andres Freund


static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = {
{ .name = "default",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
{ .name = "EIO",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
{ .name = "ENOSPC",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
{ .name = "ENODEV",
.max_retries = 0, /* We can't recover from devices disappearing */
.retry_timeout = 0,



Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-04-27 23:04:47 Re: Postgres, fsync, and OSs (specifically linux)
Previous Message Stas Kelvich 2018-04-27 21:36:16 FinishPreparedTransaction missing HOLD_INTERRUPTS section