Re: Scaling shared buffer eviction

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-11 13:22:26
Message-ID: 20140911132226.GA17294@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-09-11 09:02:34 -0400, Robert Haas wrote:
> Thanks for reviewing, Andres.
>
> On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> +static void bgreclaim_quickdie(SIGNAL_ARGS);
> >> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> >> +static void ReqShutdownHandler(SIGNAL_ARGS);
> >> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> It's exactly the same as what bgwriter.c does.

So what? There's no code in common, so I see no reason to have one
signal handler using underscores and the next one camelcase names.

> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> We really need a more centralized way to handle error cleanup in
> auxiliary processes. The current state of affairs is really pretty
> helter-skelter.

Agreed. There really should be three variants:
* full abort including support for transactions
* full abort without transactions being used (most background processes)
* abort without shared memory interactions

> But for this patch, I think we should aim to mimic
> the existing style, as ugly as it is.

Agreed.

> Background Reclaimer's Processing
> ---------------------------------
>
> The background reclaimer runs the clock sweep to identify buffers that
> are good candidates for eviction and puts them on the freelist. This
> makes buffer allocation much faster, since removing a buffer from the
> head of a linked list is much cheaper than linearly scanning the whole
> buffer pool until a promising candidate is found. It's possible that
> a buffer we add to the freelist may be accessed or even pinned before
> it's evicted; if that happens, the backend that would have evicted it
> will simply disregard it and take the next buffer instead (or run the
> clock sweep itself, if necessary). However, to make sure that doesn't
> happen too often, we need to keep the freelist as short as possible,
> so that there won't be many other buffer accesses between when the
> time a buffer is added to the freelist and the time when it's actually
> evicted.
>
> We use two water marks to control the activity of the bgreclaimer
> process. Each time bgreclaimer is awoken, it will move buffers to the
> freelist until the length of the free list reaches the high water
> mark. It will then sleep.

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

> > This section should have a description of how the reclaimer interacts
> > with the bgwriter logic. Do we put dirty buffers on the freelist that
> > are then cleaned by the bgwriter? Which buffers does the bgwriter write
> > out?
>
> The bgwriter is cleaning ahead of the strategy point, and the
> bgreclaimer is advancing the strategy point.

That sentence, in some form, should be in the above paragraph.

> I think we should
> consider having the bgreclaimer wake the bgwriter if it comes across a
> dirty buffer, because while the bgwriter only estimates the rate of
> buffer allocation, bgreclaimer *knows* the rate of allocation, because
> its own activity is tied to the allocation rate. I think there's the
> potential for this kind of thing to make the background writer
> significantly more effective than it is today, but I'm heavily in
> favor of leaving it for a separate patch.

Yes, doing that sounds like a good plan. I'm happy with that being done
in a separate patch.

> > I wonder if we don't want to increase the high watermark when
> > tmp_recent_backend_clocksweep > 0?
>
> That doesn't really work unless there's some countervailing force to
> eventually reduce it again; otherwise, it'd just converge to infinity.
> And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> > locking here? The atomic op will transfer the cacheline exclusively to
> > the reclaimer's CPU. Even though it very shortly afterwards will be
> > touched afterwards by the pinning backend.
>
> Meh. I'm not in favor of adding more funny games with locking unless
> we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

> > * Are we sure that the freelist_lck spinlock won't cause pain? Right now
> > there will possibly be dozens of processes busily spinning on it... I
> > think it's a acceptable risk, but we should think about it.
>
> As you and I have talked about before, we could reduce contention here
> by partitioning the freelist, or by using a CAS loop to pop items off
> of it. But I am not convinced either is necessary; I think it's hard
> for the system to accumulate enough people hitting the freelist
> simultaneously to matter, because the stuff they've got to do between
> one freelist access and the next is generally going to be something
> much more expensive, like reading 8kB from the OS.
>
> One question in my mind is whether we ought to separate this patch
> into two - one for the changes to the locking regime, and another for
> the addition of the bgreclaimer process. Those things are really two
> different features, although they are tightly enough coupled that
> maybe it's OK to keep them together.

I think it's ok to commit them together. Hm, although: It'd actually be
highly interesting to see the effect of replacing the freelist lock with
a spinlock without the rest of these changes. I think that's really a
number we want to see at least once.

> I also think it would be good to
> get some statistics on how often regular backends are running the
> clocksweep vs. how often bgreclaimer is satisfying their needs.

I think that's necessary. The patch added buf_backend_clocksweep. Maybe
we just also need buf_backend_from_freelist?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2014-09-11 13:22:38 Re: Scaling shared buffer eviction
Previous Message Robert Haas 2014-09-11 13:02:34 Re: Scaling shared buffer eviction