Re: Clock sweep not caching enough B-Tree leaf pages?

From: Jim Nasby <jim(at)nasby(dot)net>
To: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Clock sweep not caching enough B-Tree leaf pages?
Date: 2014-04-14 23:02:46
Message-ID: 534C6916.7090205@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/14/14, 12:11 PM, Peter Geoghegan wrote:
> I have some theories about the PostgreSQL buffer manager/clock sweep.
> To motivate the reader to get through the material presented here, I
> present up-front a benchmark of a proof-of-concept patch of mine:
>
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/
>
> Test Set 4 represents the patches performance here.
>
> This shows some considerable improvements for a tpc-b workload, with
> 15 minute runs, where the buffer manager struggles with moderately
> intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
> memory in total. The scale factor is 5,000 here, so that puts the
> primary index of the accounts table at a size that makes it impossible
> to cache entirely within shared_buffers, by a margin of couple of
> GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
> GB". Obviously the heap is much larger, since for that table heap
> tuples are several times the size of index tuples (the ratio here is
> probably well below the mean, if I can be permitted to make a vast
> generalization).
>
> PostgreSQL implements a clock sweep algorithm, which gets us something
> approaching an LRU for the buffer manager in trade-off for less
> contention on core structures. Buffers have a usage_count/"popularity"
> that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
> algorithm only has one bit for what approximates our "usage_count" (so
> it's either 0 or 1). I think that at its core CLOCK is an algorithm
> that has some very desirable properties that I am sure must be
> preserved. Actually, I think it's more accurate to say we use a
> variant of clock pro, a refinement of the original CLOCK.

I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of which has it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiple page pools without the overhead. In reality I believe that argument is false, because the clocks for each page pool in an OS *run at different rates* based on system demands.

I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember this difference any time we look at what OSes do.

> If you look at the test sets that this patch covers (with all the
> tricks applied), there are pretty good figures throughout. You can
> kind of see the pain towards the end, but there are no dramatic falls
> in responsiveness for minutes at a time. There are latency spikes, but
> they're *far* shorter, and much better hidden. Without looking at
> individual multiple minute spikes, at the macro level (all client
> counts for all runs) average latency is about half of what is seen on
> master.

My guess would be that those latency spikes are caused by a need to run the clock for an extended period. IIRC there's code floating around that makes it possible to measure that.

I suspect it would be very interesting to see what happens if your patch is combined with the work that (Greg?) did to reduce the odds of individual backends needing to run the clock. (I know part of that work looked at proactively keeping pages on the free list, but I think there was more to it than that).
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2014-04-14 23:04:37 Re: Excessive WAL generation and related performance issue
Previous Message Joe Conway 2014-04-14 22:51:41 Re: Excessive WAL generation and related performance issue