Re: Page replacement algorithm in buffer cache

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page replacement algorithm in buffer cache
Date: 2013-03-23 00:27:38
Message-ID: CA+CSw_sMmHt-Up+SiyL4OPkrgfECQVnMeOpyJz+=M7x=DNTzWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> well if you do a non-locking test first you could at least avoid some
> cases (and, if you get the answer wrong, so what?) by jumping to the
> next buffer immediately. if the non locking test comes good, only
> then do you do a hardware TAS.
>
> you could in fact go further and dispense with all locking in front of
> usage_count, on the premise that it's only advisory and not a real
> refcount. so you only then lock if/when it's time to select a
> candidate buffer, and only then when you did a non locking test first.
> this would of course require some amusing adjustments to various
> logical checks (usage_count <= 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2013-03-23 00:46:20 Re: SDP query optimizer
Previous Message Josh Berkus 2013-03-23 00:22:28 Re: SDP query optimizer