Quick Links

Re: Page replacement algorithm in buffer cache

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page replacement algorithm in buffer cache
Date:	2013-04-01 16:55:07
Message-ID:	CAHyXU0yd0J6ui37KbBH1Nu1CY=UsLrWJRMk_i2sSwok3ERC_aw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Apr 1, 2013 at 10:03 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Apr 1, 2013 at 9:28 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> That is one of multiple issues. Contention on the BufFreelistLock is
>>> another one. I agree that usage_count maintenance is unlikely to become a
>>> bottleneck unless one or both of those is fixed first (and maybe not even
>>> then)
>>
>> usage_count manipulation is not a bottleneck but that is irrelevant.
>> It can be affected by other page contention which can lead to priority
>> inversion. I don't be believe there is any reasonable argument that
>> sitting and spinning while holding the BufFreelistLock is a good idea.
>
> Hmm. I'm not sure who if anyone I'm agreeing or disagreeing with, but
> I think a big part of the reason why BufFreelistLock contention is
> such a big problem is that we do other operations that involve atomics
> while we're holding that lock. You can have contention on a lock even
> if you just take it, do some stuff, and release it. But the longer
> you hold the lock for, the less concurrency you need to have in order
> to get a contention problem. And atomics take a LOT longer to execute
> than regular instructions - so it seems to me that usage_count
> manipulation is relevant not so much because we get contention on the
> buffer spinlocks as because it means we're sitting there serially
> taking and releasing locks while sitting on BufFreelistLock.
>
> In fact, BufFreelistLock is really misnamed, because for the most
> part, the "free list" as we implement is going to be empty. What the
> BufFreelistLock is really doing is serializing the process of scanning
> for a free buffer. I think THAT is the problem. If we could arrange
> things so as to hold BufFreelistLock only for the amount of time
> needed to remove a buffer from a freelist ... we'd probably buy
> ourselves quite a bit.

right. I'm imaging a buffer scan loop that looks something like
(uncompiled, untested) this. "TryLockBufHdr" does a simple TAS
without spin, returning the lock state (well, true if it acquired the
lock). usage_count is specifically and deliberately adjusted without
having a lock on the buffer header (this would require some careful
testing and possible changes elsewhere):

for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];

if (++StrategyControl->nextVictimBuffer >= NBuffers)
{
StrategyControl->nextVictimBuffer = 0;
StrategyControl->completePasses++;
}

/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
* it; decrement the usage_count (unless pinned) and keep scanning.
*/
if (buf->refcount == 0)
{
if (buf->usage_count > 0)
{
buf->usage_count--;
trycounter = NBuffers;
}
else
{
if (TryLockBufHdr(buf)
{
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
}
}

if (--trycounter == 0)
{
/*
* We've scanned all the buffers without making any state changes,
* so all the buffers are pinned (or were when we looked at them).
* We could hope that someone will free one eventually, but it's
* probably better to fail than to risk getting stuck in an
* infinite loop.
*/
elog(ERROR, "no unpinned buffers available");
}
}

The advantages are:
*) no spinning under the free list lock (ever)
*) consequently pretty much zero chance of getting sheduled out
*) far less (in some cases drastically less) atomic operations unless
the sweep is generally returning a buffer immediately upon every
request, in which case it's about the same
*) less cpu cycles overall, unless you somehow miss a whole bunch of
buffers that claim locked when in fact they are not (which seems
improbable)

I think you could implement something approximating the above in
conjunction with your buffer nailing, or without (although your stuff
would likely reduce the number of cases where you'd really need it).
Ditto Jeff J's idea to completely replace BufFreeListLock with
spinlock implementation.

merlin

In response to

Re: Page replacement algorithm in buffer cache at 2013-04-01 15:03:46 from Robert Haas

Responses

Re: Page replacement algorithm in buffer cache at 2013-04-01 20:32:49 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Janes	2013-04-01 17:37:47	Re: regression test failed when enabling checksum
Previous Message	Tom Lane	2013-04-01 16:46:26	Re: Hash Join cost estimates