On Sun, Feb 15, 2009 at 1:16 PM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> On Fri, 13 Feb 2009, Robert Haas wrote:
>> Gather statistics on relation access patterns and use that to estimate the
>> fraction of a relation likely to be in cache.
> At one point I had a hacked background writer that collected statistics
> about the contents of the buffer cache. Since it's obtaining a lock on the
> buffer header anyway, it's a perfectly good place to note what relfileid the
> buffer is associated with. If you set aside some fixed amount of space to
> hold information about the most popular relations (perhaps using a
> continuous top-k model, see
> http://www.mysmu.edu/faculty/kyriakos/topk-SIGMOD06.pdf ), you can end up
> with enough data to estimate how much data in shared_buffers exists for the
> most cached relations in there.
> In a typical recommended tuning nowadays, we can only expect that to sample
> about 1/3 of the total caching happening (presuming shared_buffers=1/4 RAM
> and effective_cache_size~=3/4 RAM). While in general it's nice to think
> that shared_buffers has a similar makeup to what the OS is caching, it's not
> hard to discover common cases where this would not be the case.
> Particularly given the VACUUM/seq scan ring-buffer improvements in 8.3,
> it's easy to imagine scanning a table that's 2*shared_buffers in size
> showing only 256KB in shared_buffers, while the whole thing is available in
> the OS cache.
> I had a eureka moment where I realized I could hook the buffer eviction code
> to model that. Decrement the count for that relation in the main top-k
> count, then have a second count that assumes the last 2*shared_buffers
> evicted are also still cached. That would accurately model the ring-buffer
> case and improve the quality of the model in general. Updating those stats
> on every eviction would add some overhead, but if the background writer is
> doing enough of them for you that should at least be asynchronous from when
> most backends are blocked waiting for an eviction.
> And that's as far as I got before I had to return to real work again.
This seems plausible, but I'm not totally sold: predicting the
contents of the operating system buffer cache sounds like it might be
pretty touch. And do we even need to go that far? I'm kind of
wondering whether we might be able to leverage the information that
the statistics collector already gathers for this purpose - in
particular, the information on blocks fetched and read. That might
not exactly model the current contents of the buffer cache, but it's
certainly a measure of popularity, and that may be all we really need.
We're not going to invalidate every plan in the system on every
buffer eviction, so plans have to be based not so much on what is in
the buffer cache right now but on what we have a reasonable
expectation of finding there in the typical case.
Consider, for example, the degenerate (but not necessarily uncommon)
case where the entire database can fit within shared_buffers, or
perhaps shared_buffers + OS cache. ISTM we're going to want to plan
as if the entire database is in cache all the time, even though that
might not always be true - right after restart, for example.
I kind of feel like the algorithm for predicting the cache hit ratio
ought to be some sort of ANALYZE-like process that runs periodically
and stores away a value for each table. When we run the algorithm, we
take a user-supplied estimate of the total cache available on the
machine (effective_cache_size, or some new GUC we invent for this
purpose) and divvy it up among all the relations in the database in
proportion to popularity score. That is, we divide the number of
popularity points for the most popular relation by the total of all
popularity values and assign that amount of cache to the most popular
relation (but not more than the relation size). We then repeat this
algorithm with the remaining relations and the remaining amount of
cache, and then approximate the cache hit ratio for each relation as
the allocated amount of cache divided by the relation size. To
prevent plans from changing too abruptly, the popularity function
should use some sort of sliding window.
Where you're going to run into trouble with any algorithm of this type
is databases that switch between multiple, distinct workloads. I have
no idea what to do about that problem. You might also run into
problems with relations that have "hot" segments that are accessed
frequently and stay cached, and "cold" segments that are never
touched: if 20% of the relation is in cache, but that's the only 20%
of the relation we ever access, then our hit rate will be 100% rather
But even a primitive algorithm would probably be a lot better than
what we have now. I'm guessing that there are a lot of databases where
either the whole database fits in cache, or a decent chunk of
relatively small core relations fit in cache and then there are some
big or infrequently-used ones that don't.
In response to
pgsql-hackers by date
|Next:||From: ITAGAKI Takahiro||Date: 2009-02-16 04:04:38|
|Subject: Questions about parsing boolean and casting to anyelement|
|Previous:||From: KaiGai Kohei||Date: 2009-02-16 02:10:19|
|Subject: Re: SE-PostgreSQL and row level security|