Re: Refactoring the checkpointer's fsync request queue

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Shawn Debnath <sdn(at)amazon(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2018-11-23 04:45:18
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Sat, Nov 17, 2018 at 10:53 AM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > While testing this patch with frequent checkpoints I've stumbled upon an
> > interesting error, that happened already after I finished one test:
> >
> > TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Fixed in the 0001 patch (and a similar problem in the WIN32 branch).

On Thu, Nov 15, 2018 at 10:37 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > > That sounds a little like you are proposing to go back to the way
> > > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> > > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> > > the division of labor wouldn't be quite the same.
> >
> > But is there an argument against it? The checkpointer would still be
> > creating checkpoints including running fsync, but the background
> > writer would be, erm, writing, erm, in the background.
> I don't know. I guess the fact that the checkpointer is still
> performing the fsyncs is probably a key point. I mean, in the old
> division of labor, fsyncs could interrupt the background writing that
> was supposed to be happening.

Robert explained off-list that BgBufferSync() and BufferSync() have
rather different goals, and performing them in the same loop without
major reengineering to merge their logic would probably not work out
well. So I'm abandoning that plan for now (though it could perhaps be
interesting if done right).

I do have a new plan though...

On Wed, Nov 14, 2018 at 7:01 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> ... I've previously wondered whether
> there's any way we could delay the write to a point where the buffer is
> not locked anymore - as far as I can tell it's actually not required for
> correctness that we send the fsync request before unlocking. It's
> architecturally a bit dicey tho :(

... and it's basically what Andres said there ^^^.

The specific hazard I wondered about is when a checkpoint begins after
BufferAlloc() calls pwrite() but before it calls sendto(), so that we
fail to fsync() a file that was modified before the checkpoint LSN.
But, AFAICS, assuming we call sendto() before we update the buffer
header, there are only two possibilities from the point of view of

1. The checkpointer's BufferSync() loop arrives before we update the
buffer header, so it sees the buffer as dirty, writes it out (again),
remembers that the segment is dirty, and then when we eventually get
the buffer header lock we see that it's not dirty anymore and we just
skip the buffer.

2. The checkpointer's BufferSync() loop arrives after we updated the
buffer header, so it sees it as invalid (or some later state), which
means that we have already called sendto() (before we updated the

Either way, the checkpointer finishes up calling fsync() before the
checkpoint completes, as it should, and the worst that can happen due
to bad timing is a harmless double pwrite().

I noticed a subtle problem though. Suppose we have case 2 above.
After BufferSync() returns in the checkpointer, our backend has called
sendto() to register a dirty file. In v2 the checkpointer runs
AbsorbAllFsyncRequests() to drain the pipe until it sees a message for
the current cycle (that is, it absorbs messages for the previous
cycle). That's obviously not good enough, since backends race to call
sendto() and a message for cycle n - 1 might be hiding behind a
message for cycle n. So I propose to drain the pipe until it is empty
or we see a message for cycle n + 1 (where n is the current cycle
before we start draining, meaning that we ran out of fds and forced a
new cycle in FlushFsyncRequestQueueIfNecessary()). I think that
works, because although we can't be sure that we'll receive all
messages for n - 1 before we receive a message for n due to races on
the insert side, we *can* be certain that we'll receive all messages
for n - 1 before we receive a message for n + 1, because we know that
they were already in the pipe before we began. In the happy case, our
system never runs out of fds so the pipe will eventually be empty,
since backends send at most one message per cycle per file and the
number of files is finite, and in the sad case, there are too many
dirty files per cycle, so we keep creating new cycles while absorbing,
but again the loop is bounded because we know that seeing n + 1 is
enough for our purpose (which is to fsync all files that were already
mentioned in messages send to the pipe before we started our loop).

That's implemented in the 0002 patch, separated for ease of review.
The above theories cover BufferAlloc()'s interaction with a
checkpoint, and that seems to be the main story. I'm not entirely
sure about the other callers of FlushBuffer() or FlushOneBuffer() (eg
special case init fork stuff), but I've run out of brain power for now
and wanted to post an update.

Thomas Munro

Attachment Content-Type Size
0001-Refactor-the-checkpointer-s-data-sync-request-que-v3.patch application/octet-stream 114.2 KB
0002-Fix-deadlock-by-sending-without-content-lock-but--v3.patch application/octet-stream 9.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2018-11-23 06:26:43 Re: WIP: Avoid creation of the free space map for small tables
Previous Message David Rowley 2018-11-23 04:41:26 Re: Inadequate executor locking of indexes