Re: our buffer replacement strategy is kind of lame

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: our buffer replacement strategy is kind of lame
Date: 2012-01-03 17:15:47
Message-ID: CA+TgmobTUhTC3EJa-QLY9i=V5PCa4MM8JX2FE32DZC3rMhrsKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 3, 2012 at 10:56 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> The clock sweep is where all the time goes, in its current form.
>>
>> ...but I agree with this.  In its current form, the clock sweep has to
>> acquire a spinlock for every buffer it touches.  That's really
>> expensive, and I think we need to either get rid of it altogether or
>> at least find some way to move it into the background.  Ideally, in
>> the common case, a backend that wants to allocate a buffer shouldn't
>> need to do more than acquire a spinlock, pop a buffer off some sort of
>> linked list of allocation targets, and release the spinlock.  Whatever
>> other computation is needed should get pushed out of foreground
>> processing.
>
> So you don't think a freelist is worth having, but you want a list of
> allocation targets.
> What is the practical difference?

I think that our current freelist is practically useless, because it
is almost always empty, and the cases where it's not empty (startup,
and after a table or database drop) are so narrow that we don't really
get any benefit out of having it. However, I'm not opposed to the
idea of a freelist in general: I think that if we actually put in some
effort to keep the freelist in a non-empty state it would help a lot,
because backends would then have much less work to do at buffer
allocation time.

>> I have trouble accepting that this is really the problem we should be
>> solving.  There's only one background writer process, and that process
>> is waking up every 200ms by default.  Many people probably tune that
>> down quite a bit, but unless I'm quite mistaken, it should still be a
>> drop in the bucket next to what the backends are doing.
>
> That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes,
> most of the contention is caused by the other backends, but the
> bgwriter experiences that contention currently when it has no need to
> do so.

Sure, but if that contention is a negligible fraction of the total and
doesn't cost anything measurable, then it doesn't make sense to add
code to eliminate it. There are all sorts of places in the system
where we have contention at a level that doesn't affect performance.
Many of those could probably be fixed by adding more complicated
locking, but that would just complicate the code to no purpose. If
there's a demonstrable performance benefit to this change then it's
worth a harder look, but my gut feeling is that there won't be.

>>> 2. When a backend can't find a free buffer, it spins for a long time
>>> while holding the lock. This makes the buffer strategy O(N) in its
>>> worst case, which slows everything down. Notably, while this is
>>> happening the bgwriter sits doing nothing at all, right at the moment
>>> when it is most needed. The Clock algorithm is an approximation of an
>>> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
>>> worst case behaviour makes sense. How much to tweak? Well,...
>>
>> I generally agree with this analysis, but I don't think the proposed
>> patch is going to solve the problem.  It may have some merit as a way
>> of limiting the worst case behavior.  For example, if every shared
>> buffer has a reference count of 5, the first buffer allocation that
>> misses is going to have a lot of work to do before it can actually
>> come up with a victim.  But I don't think it's going to provide good
>> scaling in general.  Even if the background writer only spins through,
>> on average, ten or fifteen buffers before finding one to evict, that
>> still means we're acquiring ten or fifteen spinlocks while holding
>> BufFreelistLock. I don't currently have the measurements to prove
>> that's too expensive, but I bet it is.
>
> I think its worth reducing the cost of scanning, but that has little
> to do with solving the O(N) problem. I think we need both.

Maybe. I would really like a unified solution to both problems, but
I'd be willing to accept a solution for one of them if we're confident
that we've actually solved it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-01-03 17:39:16 Re: ALTER TABLE lock strength reduction patch is unsafe
Previous Message Kohei KaiGai 2012-01-03 17:11:39 Re: [v9.2] Fix Leaky View Problem