Fsync-before-close thought experiment

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Fsync-before-close thought experiment
Date: 2019-03-03 23:30:31
Message-ID: CA+hUKGLMPXMnSLDwgnNRFPyxvf_0bJ5HwXcHWjCp7Cvh7G=xEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

We corrected our previously held belief that it's safe to retry
fsync(). We still haven't dealt with the possibility that some
kernels can forget about write-back errors due to inode cache
pressure, if there is a time when no process has the file open. To
prevent that we need to hold dirty files open, somehow, until fsync()
is called.

The leading idea so far is a scheme to keep the fd that performed the
oldest write around by passing it through a Unix domain socket to the
checkpointer. Among the problems with that are (1) there's an as-yet
unresolved deadlock problem, but let's call that just a SMoP, (2) it
relies on atomic Unix socket messages without any guarantee from the
OS, (3) it relies on being able to keep fds in the Unix domain socket
(not counted by PostgreSQL), admittedly with socket buffer size as
back pressure, (4) it relies on the somewhat obscure and badly
documented fd-passing support to work the way we think it works and
without bugs on every OS, (5) it's too complicated to back-patch. In
this thread I'd like to discuss a simpler alternative.

A more obvious approach that probably moves us closer to the way
kernel developers expect us to write programs is to call fsync()
before close() (due to vfd pressure) if you've written. The obvious
problem with that is that you could finish up doing loads more
fsyncing than we're doing today if you're regularly dirtying more than
max_files_per_process in the same backend. I wonder if we could make
it more acceptable like so:

* when performing writes, record the checkpointer's cycle counter in the File

* when closing due to vfd pressure, only call fsync() if the cycle
hasn't advanced (that is, you get to skip the fsync() if you haven't
written in this sync cycle, because the checkpointer must have taken
care of your writes from the previous cycle)

* if you find you're doing this too much (by default, dirtying more
than 1k relation segments per checkpoint cycle), maybe log a warning
that the user might want to consider increasing max_files_per_process
(and related OS limits)

A possible improvement, stolen from the fd-passing patch, is to
introduce a "sync cycle", separate from checkpoint cycle, so that the
checkpointer can process its table of pending operations more than
once per checkpoint if that would help minimise fsyncing in foreground
processes. I haven't thought much about how exactly that would work.

There is a less obvious problem with this scheme though:

1. Suppose the operating system has a single error flag for a file no
matter how many fds have it open, and it is cleared by the first
syscall to return EIO. There is a broken schedule like this (B =
regular backend, C = checkpointer):

B: fsync() -> EIO # clears error flag
C: fsync() -> success, log checkpoint
B: PANIC!

2. On an operating system that has an error counter + seen flag
(Linux 4.14+), in general you receive the error in all fds, which is a
very nice property, but we'd still have a broken schedule involving
the seen flag:

B: fsync() -> EIO # clears seen flag
C: open() # opened after error
C: fsync() -> success, log checkpoint
B: PANIC!

Here's one kind of interlocking that might work: Hash pathnames and
map to an array of lwlocks + error flags. Any process trying to sync
a file must hold the lock and check for a pre-existing error flag.
Now a checkpoint cannot succeed if any backend has recently decided to
panic. You could skip that if data_sync_retry = on.

--
Thomas Munro
https://enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jamison, Kirk 2019-03-03 23:54:00 RE: Timeout parameters
Previous Message David Rowley 2019-03-03 22:14:25 Re: NOT IN subquery optimization