Re: Scaling shared buffer eviction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-11 13:02:34
Message-ID: CA+TgmoaJiJb-ZDBMzVxUZPRO+w9UbhCfKHZaVHAWFGrnrowKEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> +static void bgreclaim_quickdie(SIGNAL_ARGS);
>> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
>> +static void ReqShutdownHandler(SIGNAL_ARGS);
>> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
>
> This looks inconsistent.

It's exactly the same as what bgwriter.c does.

> No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> good idea, regardless of it possibly being true today (which I'm not
> sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes. The current state of affairs is really pretty
helter-skelter. But for this patch, I think we should aim to mimic
the existing style, as ugly as it is. I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

>> +Two water mark indicators are used to maintain sufficient number of buffers
>> +on freelist. Low water mark indicator is used by backends to wake bgreclaimer
>> +when the number of buffers in freelist falls below it. High water mark
>> +indicator is used by bgreclaimer to move buffers to freelist.
>
> For me the description of the high water as stated here doesn't seem to
> explain anything.

Yeah, let me try to revise and expand on that a bit:

Background Reclaimer's Processing
---------------------------------

The background reclaimer runs the clock sweep to identify buffers that
are good candidates for eviction and puts them on the freelist. This
makes buffer allocation much faster, since removing a buffer from the
head of a linked list is much cheaper than linearly scanning the whole
buffer pool until a promising candidate is found. It's possible that
a buffer we add to the freelist may be accessed or even pinned before
it's evicted; if that happens, the backend that would have evicted it
will simply disregard it and take the next buffer instead (or run the
clock sweep itself, if necessary). However, to make sure that doesn't
happen too often, we need to keep the freelist as short as possible,
so that there won't be many other buffer accesses between when the
time a buffer is added to the freelist and the time when it's actually
evicted.

We use two water marks to control the activity of the bgreclaimer
process. Each time bgreclaimer is awoken, it will move buffers to the
freelist until the length of the free list reaches the high water
mark. It will then sleep. When the number of buffers on the freelist
reaches the low water mark, backends attempting to allocate new
buffers will set the bgreclaimer's latch, waking it up again. While
it's important for the high water mark to be small (for the reasons
described above), we also need to ensure adequate separation between
the low and high water marks, so that the bgreclaimer isn't constantly
being awoken to find just a handful of additional candidate buffers,
and we need to ensure that the low watermark is adequate to keep the
freelist from becoming completely empty before bgreclaimer has time to
wake up and beginning filling it again.

> This section should have a description of how the reclaimer interacts
> with the bgwriter logic. Do we put dirty buffers on the freelist that
> are then cleaned by the bgwriter? Which buffers does the bgwriter write
> out?

The bgwriter is cleaning ahead of the strategy point, and the
bgreclaimer is advancing the strategy point. I think we should
consider having the bgreclaimer wake the bgwriter if it comes across a
dirty buffer, because while the bgwriter only estimates the rate of
buffer allocation, bgreclaimer *knows* the rate of allocation, because
its own activity is tied to the allocation rate. I think there's the
potential for this kind of thing to make the background writer
significantly more effective than it is today, but I'm heavily in
favor of leaving it for a separate patch.

> I wonder if we don't want to increase the high watermark when
> tmp_recent_backend_clocksweep > 0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

> Hm. Perhaps we should do a bufHdr->refcount != zero check without
> locking here? The atomic op will transfer the cacheline exclusively to
> the reclaimer's CPU. Even though it very shortly afterwards will be
> touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

> * Are we sure that the freelist_lck spinlock won't cause pain? Right now
> there will possibly be dozens of processes busily spinning on it... I
> think it's a acceptable risk, but we should think about it.

As you and I have talked about before, we could reduce contention here
by partitioning the freelist, or by using a CAS loop to pop items off
of it. But I am not convinced either is necessary; I think it's hard
for the system to accumulate enough people hitting the freelist
simultaneously to matter, because the stuff they've got to do between
one freelist access and the next is generally going to be something
much more expensive, like reading 8kB from the OS.

One question in my mind is whether we ought to separate this patch
into two - one for the changes to the locking regime, and another for
the addition of the bgreclaimer process. Those things are really two
different features, although they are tightly enough coupled that
maybe it's OK to keep them together. I also think it would be good to
get some statistics on how often regular backends are running the
clocksweep vs. how often bgreclaimer is satisfying their needs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-09-11 13:22:26 Re: Scaling shared buffer eviction
Previous Message ktm@rice.edu 2014-09-11 13:01:01 Re: [REVIEW] Re: Compression of full-page-writes