Re: Refactoring the checkpointer's fsync request queue

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Shawn Debnath <sdn(at)amazon(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2018-11-13 23:43:59
Message-ID: CAEepm=1W-pVuKeSd93AbxdGuEY8OW01RjQy8xtMEwjPbAgkQcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(Replies to a couple of different messages below)

On Wed, Nov 14, 2018 at 6:04 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > There is one major problem with this patch
>
> If there's only one, you're doing great! Although admittedly this
> seems like a big one...

Make that two.

> > 2. Offload the BufferSync() work to bgwriter, so the checkpointer can
> > keep draining the pipe. Communication between checkpointer and
> > bgwriter can be fairly easily multiplexed with the pipe draining work.
>
> That sounds a little like you are proposing to go back to the way
> things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

Admittedly it adds a whole extra rabbit hole to this rabbit hole,
which was itself a diversion from my goal of refactoring the syncing
machinery to support undo logs. But the other two ideas seem to suck
and/or have correctness issues.

On Wed, Nov 14, 2018 at 7:43 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> > > a bit of a fool's errand, though. I mean, there's no guarantee that
> > > the first FD that gets passed to the checkpointer is the first one
> > > opened, or even the first one written, is there?
> > I'm not sure I understand the danger you're seeing here. It doesn't have
> > to be the first fd opened, it has to be an fd that's older than all the
> > writes that we need to ensure made it to disk. And that ought to be
> > guaranteed by the logic? Between the FileWrite() and the
> > register_dirty_segment() (and other relevant paths) the FD cannot be
> > closed.
>
> Suppose backend A and backend B open a segment around the same time.
> Is it possible that backend A does a write before backend B, but
> backend B's copy of the fd reaches the checkpointer before backend A's
> copy? If you send the FD to the checkpointer before writing anything
> then I think it's fine, but if you write first and then send the FD to
> the checkpointer I don't see what guarantees the ordering.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

In your example, both fds arrive in the checkpointer in some order,
and it will keep the one with the older sequence number and close the
other one. This sorting of all interesting fds will be forced before
the checkpoint completes by AbsorbFsyncRequests(), which drains all
messages from the pipe until it sees a message for the next checkpoint
cycle.

Hmm, I think there is a flaw in the plan here though. Messages for
different checkpoint cycles race to enter the pipe around the time the
cycle counter is bumped, so you could have a message for n hiding
behind a message for n + 1 and not drain enough; I'm not sure and need
to look at something else today, but I see a couple of potential
solutions to that which I will mull over, based on either a new shared
counter increment or a special pipe message written after BufferSync()
by the bgwriter (if we go for idea #2; Andres had something similar in
the original prototype but it could self-deadlock). I need to figure
out if that is a suitable barrier due to buffer interlocking.

> > > It seems like if you wanted to make this work reliably, you'd need to
> > > do it the other way around: have the checkpointer (or some other
> > > background process) open all the FDs, and anybody else who wants to
> > > have one open get it from the checkpointer.
> >
> > That'd require a process context switch for each FD opened, which seems
> > clearly like a no-go?
>
> I don't know how bad that would be. But hey, no cost is too great to
> pay as a workaround for insane kernel semantics, right?

Yeah, seems extremely expensive and unnecessary. It seems sufficient
to track the global opening order... or at least a proxy that
identifies the fd that performed the oldest write. Which I believe
this patch is doing.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2018-11-14 00:27:22 Re: Convert MAX_SAOP_ARRAY_SIZE to new guc
Previous Message Andres Freund 2018-11-13 23:30:21 Re: TupleTableSlot abstraction