Re: Fsync-before-close thought experiment

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync-before-close thought experiment
Date: 2019-03-05 14:58:50
Message-ID: CA+Tgmob1OotvHkQmjy=Gcw_oo80vTexgz+U6c7QPGVd=Q_JB5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 3, 2019 at 6:31 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> A more obvious approach that probably moves us closer to the way
> kernel developers expect us to write programs is to call fsync()
> before close() (due to vfd pressure) if you've written.

Interesting....

> The obvious
> problem with that is that you could finish up doing loads more
> fsyncing than we're doing today if you're regularly dirtying more than
> max_files_per_process in the same backend.

Check.

> I wonder if we could make
> it more acceptable like so:
>
> * when performing writes, record the checkpointer's cycle counter in the File
>
> * when closing due to vfd pressure, only call fsync() if the cycle
> hasn't advanced (that is, you get to skip the fsync() if you haven't
> written in this sync cycle, because the checkpointer must have taken
> care of your writes from the previous cycle)

Hmm, OK.

> * if you find you're doing this too much (by default, dirtying more
> than 1k relation segments per checkpoint cycle), maybe log a warning
> that the user might want to consider increasing max_files_per_process
> (and related OS limits)
>
> A possible improvement, stolen from the fd-passing patch, is to
> introduce a "sync cycle", separate from checkpoint cycle, so that the
> checkpointer can process its table of pending operations more than
> once per checkpoint if that would help minimise fsyncing in foreground
> processes. I haven't thought much about how exactly that would work.

Yeah, that seems worth considering. I suppose that a backend could
keep track of how many times it's recorded the current sync cycle in a
File that is still open -- this seems like it should be pretty simple
and cheap, provided we can find the right places to put the counter
adjustments. If that number gets too big, like say greater than 80%
of the number of fds, it sends a ping to the checkpointer. I'm not
sure if that would then immediately trigger a full sync cycle or if
there is something more granular we could do.

> There is a less obvious problem with this scheme though:
>
> 1. Suppose the operating system has a single error flag for a file no
> matter how many fds have it open, and it is cleared by the first
> syscall to return EIO. There is a broken schedule like this (B =
> regular backend, C = checkpointer):
>
> B: fsync() -> EIO # clears error flag
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> 2. On an operating system that has an error counter + seen flag
> (Linux 4.14+), in general you receive the error in all fds, which is a
> very nice property, but we'd still have a broken schedule involving
> the seen flag:
>
> B: fsync() -> EIO # clears seen flag
> C: open() # opened after error
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> Here's one kind of interlocking that might work: Hash pathnames and
> map to an array of lwlocks + error flags. Any process trying to sync
> a file must hold the lock and check for a pre-existing error flag.
> Now a checkpoint cannot succeed if any backend has recently decided to
> panic. You could skip that if data_sync_retry = on.

That might be fine, but I think it might be possible to create
something more light-weight. Suppose we just decide that foreground
processes always win and the checkpointer has to wait before logging
the checkpoint. To that end, a foreground process advertises in
shared memory whether or not it is currently performing an fsync. The
checkpointer must observe each process until it sees that process in
the not-fsyncing state at least once.

If a process starts fsync-ing after being observed not-fsyncing, it
just means that the backend started doing an fsync() after the
checkpointer had completed the fsyncs for that checkpoint. And in
that case the checkpointer would have observed the EIO for any writes
prior to the checkpoint, so it's OK to write that checkpoint; it's
only the next one that has an issue, and the fact that we're now
advertising that we are fsync()-ing again will prevent that one from
completing before we emit any necessary PANIC.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2019-03-05 15:05:42 Re: Index Skip Scan
Previous Message Tom Lane 2019-03-05 14:50:59 Re: [Issue] Can't recompile cube extension as PGXS, utils/float.h is not installed