Re: our buffer replacement strategy is kind of lame

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: our buffer replacement strategy is kind of lame
Date: 2012-01-03 15:18:22
Message-ID: CA+TgmoYtL0-uB+sOf+P-DJ_qhRFgRn=RdYRnv_8y8Y4r4vqfqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 2, 2012 at 2:53 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Get rid of the freelist?  Once shared buffers are full, it's just about
>> useless anyway.  But you'd need to think about the test cases that you
>> pay attention to, as there might be scenarios where it remains useful.
>
> Agree freelist is mostly useless, but startup and dropping objects requires it.

Not really. If someone drops an object, we must invalidate all the
buffers immediately, but adding them to the free list is just an
optimization to make sure we reuse the invalidated buffers in
preference to evicting buffers that still have valid contents. But I
suspect that in practice this is not a very important optimization,
because (1) most people probably don't drop permanent tables or
databases very often and (2) buffers that are being heavily used
should have a positive reference count, in which case the clock sweep
will pass over them and land on one of the newly-invalidated buffers
anyway.

> When its full its just an integer > 0 test, which is cheap and its on
> the same cacheline as the nextVictimBuffer anyway, so we have to fetch
> it.
>
> The clock sweep is where all the time goes, in its current form.

...but I agree with this. In its current form, the clock sweep has to
acquire a spinlock for every buffer it touches. That's really
expensive, and I think we need to either get rid of it altogether or
at least find some way to move it into the background. Ideally, in
the common case, a backend that wants to allocate a buffer shouldn't
need to do more than acquire a spinlock, pop a buffer off some sort of
linked list of allocation targets, and release the spinlock. Whatever
other computation is needed should get pushed out of foreground
processing.

> We have 2 problems
>
> 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
> bgwriter has to fight with backends to find out where it should clean.
> As a result it spends lots of time waiting and insufficient time
> cleaning.

I have trouble accepting that this is really the problem we should be
solving. There's only one background writer process, and that process
is waking up every 200ms by default. Many people probably tune that
down quite a bit, but unless I'm quite mistaken, it should still be a
drop in the bucket next to what the backends are doing.

> 2. When a backend can't find a free buffer, it spins for a long time
> while holding the lock. This makes the buffer strategy O(N) in its
> worst case, which slows everything down. Notably, while this is
> happening the bgwriter sits doing nothing at all, right at the moment
> when it is most needed. The Clock algorithm is an approximation of an
> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
> worst case behaviour makes sense. How much to tweak? Well,...

I generally agree with this analysis, but I don't think the proposed
patch is going to solve the problem. It may have some merit as a way
of limiting the worst case behavior. For example, if every shared
buffer has a reference count of 5, the first buffer allocation that
misses is going to have a lot of work to do before it can actually
come up with a victim. But I don't think it's going to provide good
scaling in general. Even if the background writer only spins through,
on average, ten or fifteen buffers before finding one to evict, that
still means we're acquiring ten or fifteen spinlocks while holding
BufFreelistLock. I don't currently have the measurements to prove
that's too expensive, but I bet it is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-01-03 15:24:52 Re: ALTER TABLE lock strength reduction patch is unsafe
Previous Message Pavel Stehule 2012-01-03 14:33:55 Re: Add SPI results constants available for PL/*