Re: Move unused buffers to freelist

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Move unused buffers to freelist
Date: 2013-06-27 13:01:46
Message-ID: 20130627130146.GH1254@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
> I'd like to just back up a minute here and talk about the broader
> picture here.

Sounds like a very good plan.

> So in other words,
> there's no huge *performance* problem for a working set larger than
> shared_buffers, but there is a huge *scalability* problem. Now why is
> that?

> As far as I can tell, the answer is that we've got a scalability
> problem around BufFreelistLock.

Part of the problem is it's name ;)

> Contention on the buffer mapping
> locks may also be a problem, but all of my previous benchmarking (with
> LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
> in the room.

Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.

> My interest in having the background writer add buffers
> to the free list is basically around solving that problem. It's a
> pretty dramatic problem, as the graph above shows, and this patch
> doesn't solve it.

> One thing that occurred to me while writing this note is that the
> background writer doesn't have any compelling reason to run on a
> read-only workload. It will still run at a certain minimum rate, so
> that it cycles the buffer pool every 2 minutes, if I remember
> correctly.

I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.

I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.

Problems with the current code:

* doesn't manipulate the usage_count and never does anything to used
pages. Which means it will just about never find a victim buffer in a
busy database.
* by far not aggressive enough, touches only a few buffers ahead of the
clock sweep.
* does not advance the clock sweep, so the individual backends will
touch the same buffers again and transfer all the buffer spinlock
cacheline around
* The adaption logic it has makes it so slow to adapt that it takes
several minutes to adapt.
* ...

There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...
// check freelist, will usually be empty
...
for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];

++StrategyControl->nextVictimBuffer;

LockBufHdr(buf);
if (buf->refcount == 0)
{
if (buf->usage_count > 0)
{
buf->usage_count--;
}
else
{
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
}
UnlockBufHdr(buf);
}

So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction & bouncing.

It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.

I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-27 13:02:39 Re: [PATCH] add long options to pgbench (submission 1)
Previous Message Andres Freund 2013-06-27 12:27:58 Re: XLogInsert scaling, revisited