Skip site navigation (1) Skip section navigation (2)

Re: hint bit cache v6

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hint bit cache v6
Date: 2011-06-30 20:42:26
Message-ID: (view raw or flat)
Lists: pgsql-hackers
On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> I think the basic problem is that the cache pages are so large.  It's
>>> hard to make them smaller because that increases the cost of accessing
>>> the cache, as you already noted, but at the same time, making an
>>> eviction decision on a cache that holds 64K entries based on 100 data
>>> points seems like waving around a pretty large-caliber weapon in a
>>> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
>>> it's hard to avoid the niggling fear that someone might accidentally
>>> get shot.
>>'s essentially a rolling statistical sampling of transaction
>> IDs based on past acceses.  Going from say, 100 to 1000 will probably
>> drop your errror margin quite a bit but I really wonder how benefit
>> you can get from going past that.
> An interesting idea might be to forcibly cause a replacement every 100
> tuples (or every 1000 tuples) and see if that causes a noticeable
> performance regression.  If it doesn't, that's a good data point.

To test this I disabled the cache check on top of
HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
bogus xid (so every tuple scanned was treated as a 'miss'.  Scanning 1
million records in memory over and over went up a few percentage
points -- converging on about 280ms instead of 270ms.   This is with
bumped to 1000 miss array:

oprofile said:
regular hinted tuple case:
120      10.2041  advance_aggregates
107       9.0986  heapgettup_pagemode
77        6.5476  ExecProject
74        6.2925  heapgetpage
72        6.1224  ExecProcNode
72        6.1224  ExecScan
69        5.8673  advance_transition_function
66        5.6122  heap_getnext
58        4.9320  HeapTupleSatisfiesMVCC

hinted tuple with pathological cache entry:
111       8.9588  advance_aggregates
109       8.7974  heapgettup_pagemode
85        6.8604  ExecProject
81        6.5375  heapgetpage
77        6.2147  ExecScan
70        5.6497  ExecStoreTuple
70        5.6497  HeapTupleSatisfiesMVCC

this was a fairly short profiling run but the numbers are fairly
consistent.  the replacement is fairly cheap...sorting 1000 ints
doesn't take all that long. 100 is even faster.

> I think the sour spot for this whole idea is likely to be the case
> where you have a workload that is 90% read and 10% write, or something
> like that.  On an all-read workload (like pgbench -S), you're
> hopefully going to converge to a state where all the hint-bits are
> set, once VACUUM kicks in.  And on an all-write workload I think that
> you might lose the effect in the noise.  Not sure if we have an easy
> way to set up such a workload with pgbench, though.

it's trivial to test with pgbench -- just use the random number
facility to generate an id for some table and update where random() >
.9.   However this does not generate nearly enough 'misses' to
exercise cache replacement in any meaningful way.  My workstation vm
can punch out about 5k transactions/sec.  With only 10% getting a new
xid and writing back a tuple to the table only 500 xids are getting
generated/second.  At that rate it takes quite a while to burn through
the 256k transactions the cache ranges over.  Also this case is not
bound by scan performance anyways making profiling it a little silly
-- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
workloads.  Scan heavy loads is where this matters, and scan heavy
loads naturally tend to generate less xids per tuple scanned.

>>> Just for the sake of argument, let's suppose we made an array with 64K
>>> elements, each one representing one 64K chunk of the XID space.  Each
>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>> memory... probably not good, but I'm just thinking out loud here.
>>> Every time we're asked about an XID, we bump the corresponding
>>> counter.  Periodically (say, every 64K probes) we zip through all the
>>> counters and consider performing a cache replacement.  We look at the
>>> not-already-cached page that has the highest counter value and compare
>>> it to the counter value for the cached pages.  If it is higher and the
>>> difference exceeds some safety margin (to protect against hysteresis),
>>> then we do the replacement.  I think something like this would allow
>>> you to have a lot more information available to make a direct
>>> apples-to-apples comparison at cache replacement time, as compared
>>> with what you have now.
>> yeah -- what you've done here is basically two things: 1. by mapping
>> static ranges you get to skip the sort (but not the scan) during the
>> replacement, and 2. by vastly expanding the sampling size you
>> eliminate thrashing via noise in the rolling sample.  This comes at a
>> significant memory cost and loss of some nimbleness in the cache (i'm
>> not completely sure more aggressive replacement is 'bad')  -- although
>> that could be mitigated by replacing at more frequent intervals.
> Well, that gets at an interesting question: how often SHOULD we
> replace cache pages?  My gut is that it's 10K-100K and your gut is
> that it's 100-1000, which is a hundred-fold difference.  Seems like we
> need some kind of emprical data to decide what actually works well in
> real life.

I think using a smaller 'n' for a sort based replacement is on solid
mathematical grounds.  Amortized replacement performance is (lg(n) +
(k/n)) * q where q is the miss rate and k is the page replacement
cost.  k/n is close to zero for n >= 100 so it's really lg(n) * q.
>From this we can deduce that since lg(10k)  is roughly double lg(100),
you'd  have to get double the hit rate to break even and I just don't
think that's going to happen -- it's pretty hard to even contrive
cases with miss rates > .05 (and from there replacement performance is

OTOH, your approach wants n to be larger because presumably the hit
rate would be better -- amortized replacement performance is k/n * q.
The value that gives the lowest q within reasonable memory constraints
is what you'd want.  hm.


In response to


pgsql-hackers by date

Next:From: Jeff DavisDate: 2011-06-30 22:19:09
Subject: Re: [COMMITTERS] pgsql: Make the visibility map crash-safe.
Previous:From: Devrim GÜNDÜZDate: 2011-06-30 20:00:03
Subject: Re: [COMMITTERS] pgsql: Enable CHECK constraints to be declared NOT VALID

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group