Re: Page replacement algorithm in buffer cache

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page replacement algorithm in buffer cache
Date: 2013-03-31 18:27:06
Message-ID: CAMkU=1zVSyNRR_AQh4j_w6h37+qyvAz4fY8A+QP8e0dsuBg7Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, March 22, 2013, Ants Aasma wrote:

> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure(at)gmail(dot)com<javascript:;>>
> wrote:
> > well if you do a non-locking test first you could at least avoid some
> > cases (and, if you get the answer wrong, so what?) by jumping to the
> > next buffer immediately. if the non locking test comes good, only
> > then do you do a hardware TAS.
> >
> > you could in fact go further and dispense with all locking in front of
> > usage_count, on the premise that it's only advisory and not a real
> > refcount. so you only then lock if/when it's time to select a
> > candidate buffer, and only then when you did a non locking test first.
> > this would of course require some amusing adjustments to various
> > logical checks (usage_count <= 0, heh).
>
> Moreover, if the buffer happens to miss a decrement due to a data
> race, there's a good chance that the buffer is heavily used and
> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> read-test-inc/dec-store operation then you will never go out of
> bounds) However, clocksweep and usage_count maintenance is not what is
> causing contention because that workload is distributed. The issue is
> pinning and unpinning.

That is one of multiple issues. Contention on the BufFreelistLock is
another one. I agree that usage_count maintenance is unlikely to become a
bottleneck unless one or both of those is fixed first (and maybe not even
then)

...

> The issue with the current buffer management algorithm is that it
> seems to scale badly with increasing shared_buffers.

I do not think that this is the case. Neither of the SELECT-only
contention points (pinning/unpinning of index root blocks when all data is
in shared_buffers, and BufFreelistLock when all data is not in
shared_buffers) are made worse by increasing shared_buffers that I have
seen. They do scale badly with number of concurrent processes, though.

The reports of write-heavy workloads not scaling well with shared_buffers
do not seem to be driven by the buffer management algorithm, or at least
not the freelist part of it. They mostly seem to center on the kernel and
the IO controllers.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-03-31 19:45:46 Re: Hash Join cost estimates
Previous Message Tom Lane 2013-03-31 17:30:50 Re: pkg-config files for libpq and ecpg