Re: [HACKERS] Clock with Adaptive Replacement

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Clock with Adaptive Replacement
Date: 2018-04-28 23:16:59
Message-ID: CAH2-Wz=WBysPM7nhDqacH0UicK5D65aTcKP0UMOJ6Zq+jA+3rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 28, 2018 at 7:39 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> Perhaps I'm misunderstanding the case here, but with the ring buffer we
> use for sequential access, aren't we already discouraging keeping heap
> pages around, particularly when they're coming from a sequential scan?
>
> I'm not suggesting that's a bad thing either, in fact it seems like a
> generally good thing and I don't think we get many complaints about it,
> I'm just trying to point out that we already have a distinction between
> heap-pages-from-seq-scans and index pages (and heap pages from using
> those indexes) and therefore I'm not convinced by an argument that we
> shouldn't ever treat pages from the heap differently than pages from the
> indexes.

The argument is not a slam dunk, to be sure. Since you bring it up, I
happen to think that the ring buffer/strategy scan thing is too
aggressive in a lot of cases. Why not cache more when the existing
contents of shared_buffers are of marginal importance? Concern about a
kind of "tyranny of the average case" seems fairly well founded to me.
I am most concerned about improving the average case in particular,
but also very concerned about not regressing atypical or bursty cases.

> This could change when considering an environment where we aren't trying
> to take advantage of the kernel's buffer cache and things like its
> read-ahead, and maybe that's what you're getting at here, but I would
> also point out that we have more inherent understanding of how pages are
> likely to be accessed than the kernel does and we shouldn't be
> explicitly trying to avoid taking advantage of that knowledge.

I'm definitely in favor of taking advantage of that knowledge. I just
think that we can derive the choice of which buffer to evict from high
level principles, that are relatively easy to understand in isolation.
This may matter later, when production system behavior needs to be
analyzed. I might agree to the idea of artificially favoring one type
of block over another if I thought that there was no better way, but I
think that there is. I also think that a generalized approach isn't
that much more difficult, particularly when you factor in the need to
convince other people.

> I agree that we should definitely be thinking about data density here-
> and that's another case where we have more information about the data
> than the kernel does (which I compare to as it tries to predict value
> also but doesn't have as much information).

Even my 10x figure is, in a way, very conservative. B-Tree index pages
have another obvious though potentially overlooked advantage, which is
that they naturally have spatial or logical locality. The only kind of
locality that heap pages can ever have is temporal locality. I
hesitate to say too much more about what must be true on average,
across the universe of Posgres installations, but I can safely say
this much: it's totally pedestrian and normal for *leaf* pages to have
more than 100x the utility of heap pages in some applications, since a
leaf page is not just "denser"; it may also have significant spatial
or logical locality, where heap pages have none at all.

There could very easily be an enormous range of usefulness among
buffers in shared_buffers. And yet, we only recognize the increments
that usage_count can represent (this is before we even think about the
actual problems with how usage_count is maintained). We're never going
to be able to give *radically* more useful leaf pages a concomitant
advantage during buffer eviction if we have to use some rule of thumb.

> Another consideration here
> to think about would be internal pages rather than just leaf pages-
> those have an even higher likelihood of being needed again quickly as
> they need to be traversed much more than leaf pages.

As I said, I'm not too worried about that case, because there are so
few internal pages -- we're probably talking about a small fraction of
1% of the total. It seems unlikely that there'd be so much cache
pressure that the cost of reading in an internal block is much worse
than the cost of reading in any other block. Main memory sizes are big
enough that we should be able to think about things at the level of
minutes (maybe we can't right now, but we *should* be able to).

(This probably doesn't matter much when it comes to finding a way
forward, so we can agree to disagree here without consequence.)

> This is definitely an interesting and valuable point. Having pages
> bouncing in and out of shared buffers means that we aren't properly
> tracking their value. Just brainstorming, but I wonder if there's a way
> we could track their value even when we've evicted them, without slowing
> down the entire system. Not having any great ideas off-hand for that
> but some kind of "we last accessed this block an hour ago" and "we last
> accessed this block half a millisecond ago" might be interesting to try
> to avoid such ping-ponging. I wonder if there are known strategies for
> addressing this, perhaps by having a data structure which effectively
> tracks hit rates for all potential blocks...

Well, ARC has its ghost lists (so does CAR, as Thomas pointed out just
now). ARC/CAR is adaptive, in that it drifts towards favoring either
recency or frequency for the system as a whole. Provided we can make
that scalable, it will probably work well for us. The bi-modal nature
of pages in Postgres seems to make ARC or CAR a good fit. Also, I
suspect that we can afford to use a somewhat less scalable approach,
since it doesn't matter how fast buffer eviction is if its just
spinning its wheels, and because in recent years we've mitigated some
of the bottlenecks.

In a way, it's not at all surprising that we can end up with
usage_counts of 5 all around when shared_buffers is large. Blocks end
up "resting on their laurels"; they're impossible to evict based on a
large amount of blind luck, such as being involved in a burst of
activity some time ago (so much for clocksweep approximating an LRU;
this is actually more like an LFU). ARC/CAR moves a "recent" block to
the "frequent" list if the block gets referenced a second time after
an interval that is, in effect, self-correcting (it may or may not be
from the ghost recent list or the real buffer cache recent list --
either way it goes to the head of the frequent list). Moving a block
from the recent to the frequent list also enlarges the recent list
target size at the expense of the frequent list target size. "This
block in particular deserves to be a frequent block, but as a
consequence all blocks should collectively be considered slightly less
deserving of being frequent blocks."

Importantly, this allows blocks to have their extra utility recognized
by getting moved to the frequent list, but specifically tracks whether
or not the system fails to appreciate recent blocks that deserve to be
promoted to frequent blocks, but cannot get a break because they get
evicted from the recent list too quickly (including the ghost recent
list). Because the recent ghost list and frequent ghost list are fixed
in size (only the real lists/head lists get resized), we won't forget
too much about recency or frequency. And, we won't go overboard with
"writing off frequent list blocks as has-been dead weight", because we
notice that we've gone too far, and measure that in terms of would-be
cache hits that ended up as actual cache misses. Most important of
all, we won't "get stuck", as we see with clocksweep today while still
having something that's stable for stable workloads.

> Every replacement strategy which doesn't care about the kind of block
> that's being worked with will end up performing random replacement under
> enough pressure. We don't actually have to allow that to happen though
> since we do have information about the kind of block and therefore it
> seems like we should be able to keep things mediocre, and avoid letting
> things get bad under such pressure.

If you have extreme cache pressure, I don't think that there is
anything left to worry about other than the scalability of evicting
buffers. That should be a rare enough occurrence for well maintained
systems these days (again, this may only actually be true as an
aspiration). Just look at the ARC paper's statistics; they report that
they're not much better than the competition (including even a classic
LRU) when there is either mild cache pressure or extreme cache
pressure. The middle ground is the interesting part.

All of that having been said, maybe we have an independent low-level
problem: we increase usage_count when we pin a buffer, even if we last
pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
in (and came out with ARC, too). This is more of a problem with
miscounting the number of hits based on accidents around how things
are structured in the executor -- we need to be more careful about
recognizing whether or not block hits are logically distinct.
Particularly with stuff like nested loop joins.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Charles Cui 2018-04-28 23:33:44 Re: community bonding
Previous Message Tom Lane 2018-04-28 22:33:45 Re: jitflags in _outPlannedStmt and _readPlannedStmt treated as bool type