Re: Postgres, fsync, and OSs (specifically linux)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-05-18 21:03:57
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2018-04-27 15:28:42 -0700, Andres Freund wrote:
> == Potential Postgres Changes ==
> Several operating systems / file systems behave differently (See
> e.g. [2], thanks Thomas) than we expected. Even the discussed changes to
> e.g. linux don't get to where we thought we are. There's obviously also
> the question of how to deal with kernels / OSs that have not been
> updated.
> Changes that appear to be necessary, even for kernels with the issues
> addressed:
> - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
> retry recovery. While ENODEV (underlying device went away) will be
> persistent, it probably makes sense to treat it the same or even just
> give up and shut down. One question I see here is whether we just
> want to continue crash-recovery cycles, or whether we want to limit
> that.

Craig has a patch for this, although I'm not yet 100% happy with it.

> - We need more aggressive error checking on close(), for ENOSPC and
> EIO. In both cases afaics we'll have to trigger a crash recovery
> cycle. It's entirely possible to end up in a loop on NFS etc, but I
> don't think there's a way around that.

This needs to be handled.

> - The outstanding fsync request queue isn't persisted properly [3]. This
> means that even if the kernel behaved the way we'd expected, we'd not
> fail a second checkpoint :(. It's possible that we don't need to deal
> with this because we'll henceforth PANIC, but I'd argue we should fix
> that regardless. Seems like a time-bomb otherwise (e.g. after moving
> to DIO somebody might want to relax the PANIC...).

> What we could do:
> - forward file descriptors from backends to checkpointer (using
> SCM_RIGHTS) when marking a segment dirty. That'd require some
> optimizations (see [4]) to avoid doing so repeatedly. That'd
> guarantee correct behaviour in all linux kernels >= 4.13 (possibly
> backported by distributions?), and I think it'd also make it vastly
> more likely that errors are reported in earlier kernels.
> This should be doable without a noticeable performance impact, I
> believe. I don't think it'd be that hard either, but it'd be a bit of
> a pain to backport it to all postgres versions, as well as a bit
> invasive for that.
> The infrastructure this'd likely end up building (hashtable of open
> relfilenodes), would likely be useful for further things (like caching
> file size).

I've written a patch series for this. Took me quite a bit longer than I
had hoped.

The attached patchseries consists out of a few preparatory patches:
- freespace optimization to not call smgrexists() unnecessarily
- register_dirty_segment() optimization to not queue requests for
segments that locally are known to already have been dirtied. This
seems like a good optimization regardless of further changes. Doesn't
yet deal with the mdsync counter wrapping around (which is unlikely to
ever happen in practice, it's 32bit).
- some fd.c changes, I don't think they're quite right yet
- new functions to send/recv data over a unix domain socket, *including*
a file descriptor.

The main patch guarantees that fsync requests are forwarded from
backends to the checkpointer, including the file descriptor. As we do so
immediately at mdwrite() time, that guarantees that the fd has been open
from before the write started, therefore linux will guarantee that that
FD will see errors.

The design of the patch went through a few iterations. I initially
attempted to make the fsync request hashtable shared, but that turned
out to be a lot harder to do reliably *and* fast than I was anticipating
(we'd need to hold a lock for the entirety of mdsync(), dynahash doesn't
allow iteration while other backends modify).

So what I instead did was to replace the shared memory fsync request
queue with a unix domain socket (created in postmaster, using
socketpair()). CheckpointerRequest structs are written to that queue,
including the associated file descriptor. The checkpointer absorbs
those requests, and updates the local pending requests hashtable in
local process memory. To facilitate that mdsync() has read all requests
from the last cycle, checkpointer self-enqueues a token, which allows
to detect the end of the relevant portion of the queue.

The biggest complication in all this scheme is that checkpointer now
needs to keep a file descriptor open for every segment. That obviously
requires adding a few new fields to the hashtable entry. But the bigger
issue is that it's now possible that pending requests need to be
processed earlier than the next checkpoint, because of file descriptor
limits. To address that absorbing the fsync request queue will now do a
mdsync() style pass, doing the necessary fsyncs.

Because mdsync() (or rather its new workhorse mdsyncpass()) will now not
open files itself, there's no need to do deal with retries for files
that have been deleted. For the cases where we didn't yet receive a
fsync cancel request, we'll just fsync the fd. That's unnecessary, but

Obviously this is currently heavily unix specific (according to my
research all our unix platforms support say that they support sending
fds across unix domain sockets w/ SCM_RIGHTS). It's unclear whether any
OS but linux benefits from not closing file descriptors before fsync().

We could make this work for windows, without *too* much trouble (one can
just open fds in another process, using that process' handle).

I think there's some advantage in using the same approach
everywhere. For one not maintaining two radically different approaches
for complicated code. It'd also allow us to offload more fsyncs to
checkpointer, not just the ones for normal relation files, which does
seem advantageous. Not having ugly retry logic around deleted files in
mdsync() also seems nice. But there's cases where this is likely
slower, due to the potential of having to wait for checkpointer when the
queue is full.

I'll note that I think the new mdsync() is considerably simpler. Even if
we do not decide to use an approach as presented here, I think we should
make some of those changes. Specifically not unlinking the pending
requests bitmap in mdsync() seems like it both resolves existing bug
(see upthread) and makes the code simpler.

I plan to switch to working on something else for a day or two next
week, and then polish this further. I'd greatly appreciate comments till

I didn't want to do this now, but I think we should also consider
removing all awareness of segments from the fsync request queue. Instead
it should deal with individual files, and the segmentation should be
handled by md.c. That'll allow us to move all the necessary code to
smgr.c (or checkpointer?); Thomas said that'd be helpful for further
work. I personally think it'd be a lot simpler, because having to have
long bitmaps with only the last bit set for large append only relations
isn't a particularly sensible approach imo. The only thing that that'd
make more complicated is that the file/database unlink requests get more
expensive (as they'd likely need to search the whole table), but that
seems like a sensible tradeoff. Alternatively using a tree structure
would be an alternative obviously. Personally I was thinking that we
should just make the hashtable be over a pathname, that seems most


Andres Freund

Attachment Content-Type Size
v1-0001-freespace-Don-t-constantly-close-files-when-readi.patch text/x-diff 1.6 KB
v1-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patch text/x-diff 3.5 KB
v1-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patch text/x-diff 763 bytes
v1-0004-WIP-Allow-to-create-a-transient-file-for-a-previo.patch text/x-diff 2.5 KB
v1-0005-WIP-Allow-more-transient-files-and-allow-to-query.patch text/x-diff 2.1 KB
v1-0006-WIP-Optimize-register_dirty_segment-to-not-repeat.patch text/x-diff 10.0 KB
v1-0007-Heavily-WIP-Send-file-descriptors-to-checkpointer.patch text/x-diff 45.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-05-18 22:34:03 Re: inconsistency and inefficiency in setup_conversion()
Previous Message Robert Haas 2018-05-18 20:25:22 Re: Incorrect comment in get_partition_dispatch_recurse