Re: hint bit cache v6

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hint bit cache v6
Date: 2011-07-05 16:44:42
Message-ID: CAHyXU0ysTUJtZPKRqQctShpysn0QYyhUFDe1b+jcMVt58MqZ0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 3:42 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>> I think the basic problem is that the cache pages are so large.  It's
>>>> hard to make them smaller because that increases the cost of accessing
>>>> the cache, as you already noted, but at the same time, making an
>>>> eviction decision on a cache that holds 64K entries based on 100 data
>>>> points seems like waving around a pretty large-caliber weapon in a
>>>> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
>>>> it's hard to avoid the niggling fear that someone might accidentally
>>>> get shot.
>>>
>>> Right...it's essentially a rolling statistical sampling of transaction
>>> IDs based on past acceses.  Going from say, 100 to 1000 will probably
>>> drop your errror margin quite a bit but I really wonder how benefit
>>> you can get from going past that.
>>
>> An interesting idea might be to forcibly cause a replacement every 100
>> tuples (or every 1000 tuples) and see if that causes a noticeable
>> performance regression.  If it doesn't, that's a good data point.
>
> To test this I disabled the cache check on top of
> HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
> bogus xid (so every tuple scanned was treated as a 'miss'.  Scanning 1
> million records in memory over and over went up a few percentage
> points -- converging on about 280ms instead of 270ms.   This is with
> bumped to 1000 miss array:
>
> oprofile said:
> regular hinted tuple case:
> 120      10.2041  advance_aggregates
> 107       9.0986  heapgettup_pagemode
> 77        6.5476  ExecProject
> 74        6.2925  heapgetpage
> 72        6.1224  ExecProcNode
> 72        6.1224  ExecScan
> 69        5.8673  advance_transition_function
> 66        5.6122  heap_getnext
> 58        4.9320  HeapTupleSatisfiesMVCC
>
> hinted tuple with pathological cache entry:
> 111       8.9588  advance_aggregates
> 109       8.7974  heapgettup_pagemode
> 85        6.8604  ExecProject
> 81        6.5375  heapgetpage
> 77        6.2147  ExecScan
> 70        5.6497  ExecStoreTuple
> 70        5.6497  HeapTupleSatisfiesMVCC
>
> this was a fairly short profiling run but the numbers are fairly
> consistent.  the replacement is fairly cheap...sorting 1000 ints
> doesn't take all that long. 100 is even faster.
>
>> I think the sour spot for this whole idea is likely to be the case
>> where you have a workload that is 90% read and 10% write, or something
>> like that.  On an all-read workload (like pgbench -S), you're
>> hopefully going to converge to a state where all the hint-bits are
>> set, once VACUUM kicks in.  And on an all-write workload I think that
>> you might lose the effect in the noise.  Not sure if we have an easy
>> way to set up such a workload with pgbench, though.
>
> it's trivial to test with pgbench -- just use the random number
> facility to generate an id for some table and update where random() >
> .9.   However this does not generate nearly enough 'misses' to
> exercise cache replacement in any meaningful way.  My workstation vm
> can punch out about 5k transactions/sec.  With only 10% getting a new
> xid and writing back a tuple to the table only 500 xids are getting
> generated/second.  At that rate it takes quite a while to burn through
> the 256k transactions the cache ranges over.  Also this case is not
> bound by scan performance anyways making profiling it a little silly
> -- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
> workloads.  Scan heavy loads is where this matters, and scan heavy
> loads naturally tend to generate less xids per tuple scanned.
>
>>>> Just for the sake of argument, let's suppose we made an array with 64K
>>>> elements, each one representing one 64K chunk of the XID space.  Each
>>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>>> memory... probably not good, but I'm just thinking out loud here.
>>>> Every time we're asked about an XID, we bump the corresponding
>>>> counter.  Periodically (say, every 64K probes) we zip through all the
>>>> counters and consider performing a cache replacement.  We look at the
>>>> not-already-cached page that has the highest counter value and compare
>>>> it to the counter value for the cached pages.  If it is higher and the
>>>> difference exceeds some safety margin (to protect against hysteresis),
>>>> then we do the replacement.  I think something like this would allow
>>>> you to have a lot more information available to make a direct
>>>> apples-to-apples comparison at cache replacement time, as compared
>>>> with what you have now.
>>>
>>> yeah -- what you've done here is basically two things: 1. by mapping
>>> static ranges you get to skip the sort (but not the scan) during the
>>> replacement, and 2. by vastly expanding the sampling size you
>>> eliminate thrashing via noise in the rolling sample.  This comes at a
>>> significant memory cost and loss of some nimbleness in the cache (i'm
>>> not completely sure more aggressive replacement is 'bad')  -- although
>>> that could be mitigated by replacing at more frequent intervals.
>>
>> Well, that gets at an interesting question: how often SHOULD we
>> replace cache pages?  My gut is that it's 10K-100K and your gut is
>> that it's 100-1000, which is a hundred-fold difference.  Seems like we
>> need some kind of emprical data to decide what actually works well in
>> real life.
>
> I think using a smaller 'n' for a sort based replacement is on solid
> mathematical grounds.  Amortized replacement performance is (lg(n) +
> (k/n)) * q where q is the miss rate and k is the page replacement
> cost.  k/n is close to zero for n >= 100 so it's really lg(n) * q.
> From this we can deduce that since lg(10k)  is roughly double lg(100),
> you'd  have to get double the hit rate to break even and I just don't
> think that's going to happen -- it's pretty hard to even contrive
> cases with miss rates > .05 (and from there replacement performance is
> moot).
>
> OTOH, your approach wants n to be larger because presumably the hit
> rate would be better -- amortized replacement performance is k/n * q.
> The value that gives the lowest q within reasonable memory constraints
> is what you'd want.  hm.

I'm going to experiment with a new implementation incorporating some
of Robert's ideas. Even if you spend a (very) few more cycles in the
satisfies routines getting and setting the cache, it adds peace of
mind when you know your replacement is cheap and degenerate cases wont
be badly penalized -- especially when they are so difficult to test
for. The chance if this getting finished before the end of the
current commit fest is zero so I'm marking the patch 'returned with
feedback'.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2011-07-05 16:54:34 Re: Range Types, constructors, and the type system
Previous Message Abel Abraham Camarillo Ojeda 2011-07-05 16:30:56 Re: Extra check in 9.0 exclusion constraint unintended consequences