Re: hint bit cache v6

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hint bit cache v6
Date: 2011-06-30 20:42:26
Message-ID: BANLkTiky_C9W_R+Vyu3+bebrmaKnwHTwQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> I think the basic problem is that the cache pages are so large.  It's
>>> hard to make them smaller because that increases the cost of accessing
>>> the cache, as you already noted, but at the same time, making an
>>> eviction decision on a cache that holds 64K entries based on 100 data
>>> points seems like waving around a pretty large-caliber weapon in a
>>> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
>>> it's hard to avoid the niggling fear that someone might accidentally
>>> get shot.
>>
>> Right...it's essentially a rolling statistical sampling of transaction
>> IDs based on past acceses.  Going from say, 100 to 1000 will probably
>> drop your errror margin quite a bit but I really wonder how benefit
>> you can get from going past that.
>
> An interesting idea might be to forcibly cause a replacement every 100
> tuples (or every 1000 tuples) and see if that causes a noticeable
> performance regression.  If it doesn't, that's a good data point.

To test this I disabled the cache check on top of
HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
bogus xid (so every tuple scanned was treated as a 'miss'. Scanning 1
million records in memory over and over went up a few percentage
points -- converging on about 280ms instead of 270ms. This is with
bumped to 1000 miss array:

oprofile said:
regular hinted tuple case:
120 10.2041 advance_aggregates
107 9.0986 heapgettup_pagemode
77 6.5476 ExecProject
74 6.2925 heapgetpage
72 6.1224 ExecProcNode
72 6.1224 ExecScan
69 5.8673 advance_transition_function
66 5.6122 heap_getnext
58 4.9320 HeapTupleSatisfiesMVCC

hinted tuple with pathological cache entry:
111 8.9588 advance_aggregates
109 8.7974 heapgettup_pagemode
85 6.8604 ExecProject
81 6.5375 heapgetpage
77 6.2147 ExecScan
70 5.6497 ExecStoreTuple
70 5.6497 HeapTupleSatisfiesMVCC

this was a fairly short profiling run but the numbers are fairly
consistent. the replacement is fairly cheap...sorting 1000 ints
doesn't take all that long. 100 is even faster.

> I think the sour spot for this whole idea is likely to be the case
> where you have a workload that is 90% read and 10% write, or something
> like that.  On an all-read workload (like pgbench -S), you're
> hopefully going to converge to a state where all the hint-bits are
> set, once VACUUM kicks in.  And on an all-write workload I think that
> you might lose the effect in the noise.  Not sure if we have an easy
> way to set up such a workload with pgbench, though.

it's trivial to test with pgbench -- just use the random number
facility to generate an id for some table and update where random() >
.9. However this does not generate nearly enough 'misses' to
exercise cache replacement in any meaningful way. My workstation vm
can punch out about 5k transactions/sec. With only 10% getting a new
xid and writing back a tuple to the table only 500 xids are getting
generated/second. At that rate it takes quite a while to burn through
the 256k transactions the cache ranges over. Also this case is not
bound by scan performance anyways making profiling it a little silly
-- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
workloads. Scan heavy loads is where this matters, and scan heavy
loads naturally tend to generate less xids per tuple scanned.

>>> Just for the sake of argument, let's suppose we made an array with 64K
>>> elements, each one representing one 64K chunk of the XID space.  Each
>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>> memory... probably not good, but I'm just thinking out loud here.
>>> Every time we're asked about an XID, we bump the corresponding
>>> counter.  Periodically (say, every 64K probes) we zip through all the
>>> counters and consider performing a cache replacement.  We look at the
>>> not-already-cached page that has the highest counter value and compare
>>> it to the counter value for the cached pages.  If it is higher and the
>>> difference exceeds some safety margin (to protect against hysteresis),
>>> then we do the replacement.  I think something like this would allow
>>> you to have a lot more information available to make a direct
>>> apples-to-apples comparison at cache replacement time, as compared
>>> with what you have now.
>>
>> yeah -- what you've done here is basically two things: 1. by mapping
>> static ranges you get to skip the sort (but not the scan) during the
>> replacement, and 2. by vastly expanding the sampling size you
>> eliminate thrashing via noise in the rolling sample.  This comes at a
>> significant memory cost and loss of some nimbleness in the cache (i'm
>> not completely sure more aggressive replacement is 'bad')  -- although
>> that could be mitigated by replacing at more frequent intervals.
>
> Well, that gets at an interesting question: how often SHOULD we
> replace cache pages?  My gut is that it's 10K-100K and your gut is
> that it's 100-1000, which is a hundred-fold difference.  Seems like we
> need some kind of emprical data to decide what actually works well in
> real life.

I think using a smaller 'n' for a sort based replacement is on solid
mathematical grounds. Amortized replacement performance is (lg(n) +
(k/n)) * q where q is the miss rate and k is the page replacement
cost. k/n is close to zero for n >= 100 so it's really lg(n) * q.
>From this we can deduce that since lg(10k) is roughly double lg(100),
you'd have to get double the hit rate to break even and I just don't
think that's going to happen -- it's pretty hard to even contrive
cases with miss rates > .05 (and from there replacement performance is
moot).

OTOH, your approach wants n to be larger because presumably the hit
rate would be better -- amortized replacement performance is k/n * q.
The value that gives the lowest q within reasonable memory constraints
is what you'd want. hm.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2011-06-30 22:19:09 Re: [COMMITTERS] pgsql: Make the visibility map crash-safe.
Previous Message Devrim GÜNDÜZ 2011-06-30 20:00:03 Re: [COMMITTERS] pgsql: Enable CHECK constraints to be declared NOT VALID