Re: hint bit cache v6

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hint bit cache v6
Date: 2011-06-30 15:18:23
Message-ID: BANLkTikdRBuS3=ZRrgZ+YQjO6Z9348Lp7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 3:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> With the algorithm as coded, to fully flush the cache you'd have to
>> find a series of *unhinted* tuples distributed across minimum of four
>> 64k wide page ranges, with the number of tuples being over the 5%
>> noise floor.  Furthermore, to eject a cache page you have to have more
>> hits than a page already in the cache got -- hits are tracked on the
>> existing pages and the candidate pages are effectively with the
>> existing pages based on hits with the best four picked.  Also, is it
>> reasonable to expect the cache to help in these types of cases
>> anyways?
>
> *scratches head*
>
> Well, surely you must need to reset the counters for the pages you've
> chosen to include in the cache at some point.  If you don't, then
> there's a strong likelihood that after some period of time the pages
> you have in there will become so firmly nailed into the cache that
> they can never be evicted.

yup -- hit counts are reset on replacement.

> Yeah, I am inclined to think that going very far in that direction is
> going to be a big pile of fail.  TBH, I'm somewhat surprised that you
> managed to get what you already have to work without a performance
> regression.  That code path is extremely hot, and injecting more
> complexity seems like it ought to be a loser.  For purposes of this
> review, I'm taking it as given that you've tested this carefully, but
> I'll admit to lingering skepticism.

You are absolutely right to be skeptical. Here's the rub -- it's
incredibly difficult to contrive a set of circumstances where the
cache is aggressively replaced, and even if you can somehow coerce
that to happen, the underlying data is permanently altered so that it
can't happen again. Cheating by mucking around with the xid or
altering the visibility routines to force replacement isn't really
fair. AFAICT, unhinted tuples tend to be both recent and grouped into
a fairly narrow transaction range basically always. Sooner or later,
at least for tables that matter, a vacuum is going to roll around and
smear the bits back into the table. My point is that even for the
worst case I could think of -- pgbench continually jamming records
into the table, replacements tend to be quite rare. Let's assume
though I'm wrong and the pathological case is likely to happen.

> I think the basic problem is that the cache pages are so large.  It's
> hard to make them smaller because that increases the cost of accessing
> the cache, as you already noted, but at the same time, making an
> eviction decision on a cache that holds 64K entries based on 100 data
> points seems like waving around a pretty large-caliber weapon in a
> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
> it's hard to avoid the niggling fear that someone might accidentally
> get shot.

Right...it's essentially a rolling statistical sampling of transaction
IDs based on past acceses. Going from say, 100 to 1000 will probably
drop your errror margin quite a bit but I really wonder how benefit
you can get from going past that.

> Just for the sake of argument, let's suppose we made an array with 64K
> elements, each one representing one 64K chunk of the XID space.  Each
> element is a 4-byte unsigned integer, so the array takes up 256kB of
> memory... probably not good, but I'm just thinking out loud here.
> Every time we're asked about an XID, we bump the corresponding
> counter.  Periodically (say, every 64K probes) we zip through all the
> counters and consider performing a cache replacement.  We look at the
> not-already-cached page that has the highest counter value and compare
> it to the counter value for the cached pages.  If it is higher and the
> difference exceeds some safety margin (to protect against hysteresis),
> then we do the replacement.  I think something like this would allow
> you to have a lot more information available to make a direct
> apples-to-apples comparison at cache replacement time, as compared
> with what you have now.

yeah -- what you've done here is basically two things: 1. by mapping
static ranges you get to skip the sort (but not the scan) during the
replacement, and 2. by vastly expanding the sampling size you
eliminate thrashing via noise in the rolling sample. This comes at a
significant memory cost and loss of some nimbleness in the cache (i'm
not completely sure more aggressive replacement is 'bad') -- although
that could be mitigated by replacing at more frequent intervals.

> Now, the big problem here is that a 256kB array is probably too big,
> but because we don't want to blow out the lower-level CPU caches and
> also because every time we consider performing a cache replacement
> we're going to want to zero the whole thing, and that has to be fast.
> But we probably don't really need the array to cover the entire XID
> space.  vacuum_freeze_min_age defaults to 50 million transactions, and
> vacuum_freeze_table_age defaults to 150 million transactions.  If we
> only concern ourselves with the last, say, 256 million transactions,
> which ought to be way more than we will need for almost any practical
> purpose, then that's just 1/16th of the total XID space.  If we only
> worry about the last 64 million transactions, then that's just 1/64th
> of the total XID space.   Possibly we could go take an even smaller
> slice than that.  At any rate now you're talking about a much smaller
> array, although sadly with a bit more bookkeeping to keep track of
> which portion of the XID space we're currently tracking.   And maybe
> you could also use 2-byte counters rather than 4-byte counters, if
> you're careful to set it up so they can't overflow.

hm...there are a lot of routes to exploiting the fact that most of the
xid space is essentially 'dead' -- most tuples are hinted or frozen.
This *should* be exploited -- I doubt brute-forcefully mapping the
entire space will beat what I've got because even the scan starts to
hurt. You've got to cut it down to interesting ranges.

Your range counting strategy might work -- I think to do it right so
that you don't have to map the entire range is going to require two
levels of ranges such that you activate a very wide range first before
looking at 64k subranges. That way you could reduce memory consumption
significantly without sorting during the replacement.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2011-06-30 15:55:19 Re: Avoid index rebuilds for no-rewrite ALTER TABLE ALTER TYPE
Previous Message Alexander Korotkov 2011-06-30 15:04:56 Re: Small patch for GiST: move childoffnum to child