Re: [HACKERS] Clock with Adaptive Replacement

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Clock with Adaptive Replacement
Date: 2018-05-02 16:27:19
Message-ID: CA+TgmoafeZ0FvnGB7QOK3TDkkQWwpJY7PDpbc54VaDfjX0x1gQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 1, 2018 at 6:37 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> This seems to be an old idea.

I'm not too surprised ... this area has been well-studied.

> I've always had a problem with the 8GB/16GB upper limit on the size of
> shared_buffers. Not because it's wrong, but because it's not something
> that has ever been explained. I strongly suspect that it has something
> to do with usage_count saturation, since it isn't reproducible with
> any synthetic workload that I'm aware of. Quite possibly because there
> are few bursty benchmarks.

I've seen customer have very good luck going higher if it lets all the
data fit in shared_buffers, or at least all the data that is accessed
with any frequency. I think it's useful to imagine a series of
concentric working sets - maybe you have 1GB of the hottest data, 3GB
of data that is at least fairly hot, 10GB of data that is at least
somewhat hot, and another 200GB of basically cold data. Increasing
shared_buffers in a way that doesn't let the next "ring" fit in
shared_buffers isn't likely to help very much. If you have 8GB of
shared_buffers on this workload, going to 12GB is probably going to
help -- that should be enough for the 10GB of somewhat-hot stuff and a
little extra so that the somewhat-hot stuff doesn't immediately start
getting evicted if some of the cold data is accessed. Similarly,
going from 2GB to 4GB should be a big help, because now the fairly-hot
stuff should stay in cache. But going from 4GB to 6GB or 12GB to 16GB
may not do very much. It may even hurt, because the duplication
between shared_buffers and the OS page cache means an overall
reduction in available cache space. If for example you've got 16GB of
memory and shared_buffers=2GB, you *may* be fitting all of the
somewhat-hot data into cache someplace; bumping shared_buffers=4GB
almost certainly means that will no longer happen, causing performance
to tank.

I don't really think that the 8GB rule of thumb is something that
originates in any technical limitation of PostgreSQL or Linux. First
of all it's just a rule of thumb -- the best value in a given
installation can easily be something completely different. Second, to
the extent that it's a useful rule of thumb, I think it's really a
guess about what people's working set looks like: that going from 4GB
to 8GB, say, significantly increases the chances of fitting the
next-larger, next-cooler working set entirely in shared_buffers, going
from 8GB to 16GB is less likely to accomplish this, and going from
16GB to 32GB probably won't. To a lesser extent, it's reflective of
the point where scanning shared buffers to process relation drops gets
painful, and the point where an immediate checkpoint suddenly dumping
that much data out to the OS all at once starts to overwhelm the I/O
subsystem for a significant period of time. But I think those really
are lesser effects. My guess is that the big effect is balancing
increased hit ratio vs. increased double buffering.

> I agree that wall-clock time is a bad approach, actually. If we were
> to use wall-clock time, we'd only do so because it can be used to
> discriminate against a thing that we actually care about in an
> approximate, indirect way. If there is a more direct way of
> identifying correlated accesses, which I gather that there is, then we
> should probably use it.

For a start, I think it would be cool if somebody just gathered traces
for some simple cases. For example, consider a pgbench transaction.
If somebody produced a trace showing the buffer lookups in order
annotated as heap, index leaf, index root, VM page, FSM root page, or
whatever. Examining some other simple, common cases would probably
help us understand whether it's normal to bump the usage count more
than once per buffer for a single scan, and if so, exactly why that
happens. If the code knows that it's accessing the same buffer a
second (or subsequent) time, it could pass down a flag saying not to
bump the usage count again.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-05-02 16:30:29 Re: Is there a memory leak in commit 8561e48?
Previous Message Alexander Korotkov 2018-05-02 16:26:31 Re: doc fixes: vacuum_cleanup_index_scale_factor