Re: Clock with Adaptive Replacement

From: "ben(dot)manes" <ben(dot)manes(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Clock with Adaptive Replacement
Date: 2018-05-11 16:57:34
Message-ID: 1526057854777-0.post@n3.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have been working on caching as a hobby project for the last few years and
would be happy to collaborate on an analysis and design. I've tackled a lot
of these issues, wrote widely used implementations, and co-authored a paper
for ACM's Transactions on Storage on an eviction policy. A short summary of
my results is described in this HighScalability article [1].

I very much agree with the requests for traces to use real-world data. I
wrote a simulator that supports a large number of policies and traces [2].
It was invaluable. I believe ARC's DS1 should be the closest to a Postgres
workload. Their OLTP trace appears to be the I/O of the WAL, making it
recency-biased and not very interesting.

** Eviction **
ARC-based policies (CAR, CART) are only modestly better than LRU in most
workloads. It does much better in Zipf, but poorly in loopy workloads like
glimpse. I think that the problem is that it does not capture and use the
history (ghost lists) very effectively.

LIRS-based policies (ClockPro, FRD) are very strong. Unfortunately the
papers do a poor job in describing them, it is complex to implement, and
most are flawed so as to diminish the hit rate. As described in their paper,
LIRS has an unbounded history that can cause memory leaks if not capped and
the pruning is O(n).

I try to avoid Clock-based policies in my designs due to their O(n) scans.
This can cause latency spikes that are difficult to reproduce because it
depends on the policy's state. Random sampling policies don't have this
problem, but tend to under perform due to a lack of history. I prefer
amortized O(1) policies, but those are difficult due to concurrency (solved
below).

I chose TinyLFU with an admission window, called W-TinyLFU. This uses a
frequency sketch with saturating counters for its history (e.g,
CountMinSketch w/ 4-bit counters). The counters are halved every sample
period, e.g. 10x the maximum size, which efficiently ages the LFU. Instead
of trying to order for eviction, TinyLFU avoids cache pollution by rejecting
candidates with a low probability of reuse. This is done by checking the
candidate's frequency against the eviction policy's victim, and admitting
only if the candidate offers a better hit rate. The LRU admission window
helps for recency-based traces where the candidate is rejected prematurely
and when accepted is unlikely to be used again soon. This simple and compact
scheme has the highest hit rate in most real-world workloads that I have
tested. To bypass the ACM paywall, you can use the author link on my github
project's README [3].

We are currently working on making W-TinyLFU adaptive by dynamically
resizing the window size. I can send the draft paper privately if anyone is
interested. I am a Bay Area engineer working with researchers at Israel's
Technion, so the mixture has been fruitful.

** Concurrency **
As you know, LRU's O(1) nature is very attractive but the list manipulations
are not friendly for multi-threading. The lock hold times are small, but the
access rate is high and causes a lot of contention.

The solution is to use the WAL approach to record the event into a buffer
and replay it under a tryLock. This way adding to the ring buffer is a cheap
CAS and threads won't block as their work has been handed off. When the
buffers fill up they are replayed under the lock in a batch, thereby
isolating the policy from concurrent mutations. I use multiple, lossy ring
buffers to capture reads and replaying is delayed until beneficial. For
writes, I use a ring buffer that is replayed immediately and when full
causes back pressure (though is rarely filled).

In my benchmarks [4] the overhead quite small on a Zipf workload (to
simulate hot entries). The read throughput is 33% of a lock-free hash table
and scales linearly. The write throughput is nearly the same due to
contention when updating the hot entries. The read throughput is high enough
to meet performance budgets and its overhead is only observable in synthetic
micro-benchmarks.

The benefit of this approach is that for a very small penalty to reads, the
policy is mostly isolated from the multi-threading. This allows efficient
and complex O(1) algorithms that are not thread-safe. For example I use
W-TinyLFU (LRU+SLRU) for eviction and a Hierarchical TimerWheel for
expiration. While the initial infrastructure is more work, its been a boon
due to composability, a simpler mental model, and not having to use
algorithms that degrade in difficult to debug ways.

Hope that helps,
Ben

[1] http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html
[2] https://github.com/ben-manes/caffeine/wiki/Simulator
[3] https://github.com/ben-manes/caffeine
[4] https://github.com/ben-manes/caffeine/wiki/Benchmarks

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2018-05-11 17:40:54 Re: Postgres 11 release notes
Previous Message Teodor Sigaev 2018-05-11 16:49:50 Re: Postgres 11 release notes