Re: Clock sweep not caching enough B-Tree leaf pages?

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Clock sweep not caching enough B-Tree leaf pages?
Date: 2015-04-14 22:22:42
Message-ID: CAM3SWZS4b0Pk9gHkDeRX0wo40ODtYHA5p0S7_My0ruvEMPWDJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 14, 2015 at 2:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So, I was thinking about this a little bit more today, prodded by my
> coworker John Gorman. I'm wondering if we could drive this off of the
> clock sweep; that is, every time the clock sweep touches a buffer, its
> usage count goes down by one, but we also set two flag bits. Every
> time the buffer gets touched thereafter, we check whether any flag
> bits are set; if so, we clear one and increase the usage count, else
> we do nothing. So the usage count can increase at most twice per
> clock sweep. The advantage of that is that, as with Peter's approach,
> it is harder for the usage count of a buffer to max out - to get
> there, you need sustained access over a longer period of time. But
> it's not time-dependent, so replaying the same workload at twice the
> speed or half the speed on faster or slower hardware doesn't change
> the choice of which buffer to evict, which seems good.

Why is that good?

My little prototype basically implemented the LRU-K approach to
preventing correlated references (which tpc-b has some of, in a really
obvious way - not sure how significant that is). This would have
accidentally made it weight frequency better, but the prototype did
not pretend to actually implement something like LRU-K (which is not
the state of the art in any case), because it did not consider the
recency of the second-to-last access of the buffer at all.

The prototype's performance started off well, but regressed in later
pgbench-collector iterations (due to the influence of VACUUM, I
think). If I wanted to cheat, I could have only done one large 45
minute run, which would have made the prototype look much better
still.

> And it will
> cause us to prefer buffers which are accessed repeatedly over a period
> of time rather than buffers that are accessed a bunch of times in
> quick succession and then not touched again for a while, which seems
> like a good bet.

I think that people were all too quick to dismiss the idea of a wall
time interval playing some role here (at least as a defense against
correlated references, as a correlated reference period). I suppose
that that's because it doesn't fit with an intuition that says that
that kind of interval ought to be derived algebraically - magic delay
settings are considered suspect. However, the same can be said of "the
Five-minute rule", which is a highly influential rule of thumb that
Jim Gray came up with. The Five minute rule is now obsolete, but that
took a long time, and the fundamental observation still applies
(Wikipedia says it depends on what type of disks you have these days,
but the fact remains that rule of thumbs like this can be more or less
robust).

I think that correlated reference type delays *are* used effectively
in real world systems without it being fragile/overly workload
dependent.

> I can't convince myself that this fully solves the (currently
> existing) problem of usage counts increasing too fast. In theory,
> every time the clock sweep hand advances, it decrements the usage
> count of one buffer by one, but also makes it possible for the usage
> count of that buffer to increase by up to two (or by only one if the
> buffer previously had a usage count of five). So with the right
> access pattern, it seems like you could still get to a place where a
> typical buffer eviction has to do a lot of scanning before it finds a
> victim buffer. As with the present algorithm, we'd be most likely to
> have that problem when the buffer hit ratio is very high, so that
> there are many opportunities for usage counts to go up between each
> opportunity to push usage counts back down. But it seems like it
> would at least limit the damage.

> I haven't tried this or anything, so this is just random brainstorming.

That's what it'll take, I think -- random brainstorming, and crude
prototypes to test theories.

As long as we're doing random brainstorming, I'd suggest looking at
making clocksweep actually approximate LRU-K/LRU-2 (which, again, to
be clear, my prototype did not do). The clocksweep could maintain
statistics about the recency of the second-to-last access across all
buffers, and discriminate against buffers according to what bucket of
the population they fit in to. Not sure how aggressively we'd penalize
those buffers that had very old penultimate references (or credit
those that had very recent penultimate references), or what the bucket
partitioning scheme is, but that's probably where'd I'd take it next.
For example, buffers with a penultimate reference that is more than a
standard deviation below the mean would be double penalized (and maybe
the opposite, for those buffers with penultimate accesses a stddev
above the mean). If that didn't work so well, then I'd look into an
ARC style recency and frequency list (while remembering things about
already evicted blocks, which LRU-K does not do....although that paper
is from the early 1990s).

There are approaches to relieving lock contention with ARC (e.g., CAR,
or what OpenZFS now does [1]). So maybe we could just look to doing
something similar if simpler approaches turn out to be less effective.

[1] http://blog.delphix.com/prakash/2015/03/23/openzfs-reducing-arc-lock-contention/
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-04-14 22:42:28 Re: Clock sweep not caching enough B-Tree leaf pages?
Previous Message Simon Riggs 2015-04-14 22:07:51 Re: Turning off HOT/Cleanup sometimes