Re: [HACKERS] Clock with Adaptive Replacement

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Clock with Adaptive Replacement
Date: 2018-04-28 14:39:54
Message-ID: 20180428143954.GP27724@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Peter Geoghegan (pg(at)bowt(dot)ie) wrote:
> On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > I'm personally not very excited about making rules like "index pages
> > are more valuable than heap pages". Such rules will in many cases be
> > true, but it's easy to come up with cases where they don't hold: for
> > example, we might run pgbench for a while and then stop running
> > pgbench and start running big sequential scans for reporting purposes.
>
> I am not in favor of this general approach. Actually, I'm willing to
> come out against it strongly right now: let's not teach the buffer
> manager to distinguish between the AM of a block.

Perhaps I'm misunderstanding the case here, but with the ring buffer we
use for sequential access, aren't we already discouraging keeping heap
pages around, particularly when they're coming from a sequential scan?

I'm not suggesting that's a bad thing either, in fact it seems like a
generally good thing and I don't think we get many complaints about it,
I'm just trying to point out that we already have a distinction between
heap-pages-from-seq-scans and index pages (and heap pages from using
those indexes) and therefore I'm not convinced by an argument that we
shouldn't ever treat pages from the heap differently than pages from the
indexes.

This could change when considering an environment where we aren't trying
to take advantage of the kernel's buffer cache and things like its
read-ahead, and maybe that's what you're getting at here, but I would
also point out that we have more inherent understanding of how pages are
likely to be accessed than the kernel does and we shouldn't be
explicitly trying to avoid taking advantage of that knowledge.

> > We don't want to artificially pin the index buffers in shared_buffers
> > just because they're index pages; we want to figure out which pages
> > really matter. Now, I realize that you weren't proposing (and
> > wouldn't propose) a rule that index pages never get evicted, but I
> > think that favoring index pages even in some milder way is basically a
> > hack. Index pages aren't *intrinsically* more valuable; they are more
> > valuable because they will, in many workloads, be accessed more often
> > than heap pages. A good algorithm ought to be able to figure that out
> > based on the access pattern, without being explicitly given a hint, I
> > think.
>
> I won't argue against any of that, but I think that it's rather
> nuanced, and that the nuance probably matters a lot.
>
> First of all, we must have a sense of proportion about what is likely
> to be true in the average case. I mentioned to Thomas that the pgbench
> accounts table is 6x the size its primary key, and that with a uniform
> distribution + point lookups the leaf pages literally have 6x the
> utility of the heap pages. It really is that simple there. Also,
> whatever the distribution of the point lookups is, "cache more leaf
> pages" is probably going to be the effect that a better caching
> strategy would have. Even 6x is probably an underestimate of the
> utility of leaf pages compared to heap pages in the average case. I
> bet it's more like 10x for a typical OLTP app.

I agree that we should definitely be thinking about data density here-
and that's another case where we have more information about the data
than the kernel does (which I compare to as it tries to predict value
also but doesn't have as much information). Another consideration here
to think about would be internal pages rather than just leaf pages-
those have an even higher likelihood of being needed again quickly as
they need to be traversed much more than leaf pages.

> Second of all, we must have a sense of perspective about what we're
> trying to do here, which is to predict the future based on the past.
> The best outcome we can hope for is to lose less on average without
> hurting anything that already works well enough.

Agreed.

> > I believe the root of the problem here is that the usage count we have
> > today does a very poor job distinguishing what's hot from what's not.
>
> I think that the root of the problem is that we don't remember
> anything about evicted buffers. The same block can bounce in and out
> of shared_buffers repeatedly, and we don't recognize that. There are
> probably second-order effects from sizing shared_buffers. We
> needlessly tie the number of buffers to our capacity to remember
> things about block popularity. That would be bad if we had no FS
> cache, but with the FS cache it seems awful.

This is definitely an interesting and valuable point. Having pages
bouncing in and out of shared buffers means that we aren't properly
tracking their value. Just brainstorming, but I wonder if there's a way
we could track their value even when we've evicted them, without slowing
down the entire system. Not having any great ideas off-hand for that
but some kind of "we last accessed this block an hour ago" and "we last
accessed this block half a millisecond ago" might be interesting to try
to avoid such ping-ponging. I wonder if there are known strategies for
addressing this, perhaps by having a data structure which effectively
tracks hit rates for all potential blocks...

> I share the intuition that we need something adaptive, that can be
> analysed and understood relatively easily, and has little to no magic.
> That's why I'm willing to say now that we shouldn't do anything based
> on the type of the blocks involved. However, I think that that model
> might work about as well, and that this matters because it provides
> motivating examples. Surely you're right when you say that index
> blocks are not intrinsically more useful than any of type of block,
> but if they're 10x more useful on average, and 20x more useful at
> other times, then a person could be forgiven for modelling them as
> intrinsically more useful.

If we can understand that they're, on average, 10x-20x more useful, then
ignoring that seems like we're throwing away information that we
effectively know about the future.

> It's also useful to examine why a random replacement strategy is
> merely mediocre to bad, and not abysmal. That's what every replacement
> strategy ends up being with enough cache pressure anyway. This makes
> it more reasonable to focus on the average case than it is in other
> places.

Every replacement strategy which doesn't care about the kind of block
that's being worked with will end up performing random replacement under
enough pressure. We don't actually have to allow that to happen though
since we do have information about the kind of block and therefore it
seems like we should be able to keep things mediocre, and avoid letting
things get bad under such pressure.

> > There have been previous experiments around making usage_count use
> > some kind of a log scale: we make the maximum, say, 32, and the clock
> > hand divides by 2 instead of subtracting 1. I don't think those
> > experiments were enormously successful and I suspect that a big part
> > of the reason is that it's still pretty easy to get to a state where
> > the counters are maxed out for a large number of buffers, and at that
> > point you can't distinguish between those buffers any more: they all
> > look equally hot. We need something better. If a system like this is
> > working properly, things like interior index pages and visibility map
> > pages ought to show up as fiery hot on workloads where the index or
> > visibility map, as the case may be, is heavily used.
>
> I don't think that internal index pages and the visibility map are the
> problem, since there is so few of them, and they are accessed at least
> hundreds of times more frequently than the leaf pages.

Yet they still end up getting evicted under a random replacement
strategy under sufficient pressure, because we don't distinguish them
from any other buffer, and that leads to high latency on queries which
should always be extremely fast. As long as we're talking about a
system which can devolve into random replacement, we're going to have
those cases and we aren't going to have any answer for them. I'm not
really thrilled with any approach that allows that kind of degradation.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-04-28 14:46:22 Re: Patch missing from back branches
Previous Message Stephen Frost 2018-04-28 13:59:23 Re: "could not reattach to shared memory" on buildfarm member dory