Re: hint bit cache v6

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hint bit cache v6
Date: 2011-06-29 20:18:01
Message-ID: BANLkTinsy2Xnpf7hvdKSRBOcPU=JYxfztQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 3:16 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> I think it's a fair point to ask how often thrashing cases will truly
> come up where you don't have some other significant cost like i/o.
> Even when you do thrash, you are just falling back on stock postgres
> behaviors minus the maintenance costs.

Agree.

> With the algorithm as coded, to fully flush the cache you'd have to
> find a series of *unhinted* tuples distributed across minimum of four
> 64k wide page ranges, with the number of tuples being over the 5%
> noise floor.  Furthermore, to eject a cache page you have to have more
> hits than a page already in the cache got -- hits are tracked on the
> existing pages and the candidate pages are effectively with the
> existing pages based on hits with the best four picked.  Also, is it
> reasonable to expect the cache to help in these types of cases
> anyways?

*scratches head*

Well, surely you must need to reset the counters for the pages you've
chosen to include in the cache at some point. If you don't, then
there's a strong likelihood that after some period of time the pages
you have in there will become so firmly nailed into the cache that
they can never be evicted.

To be clear, I don't really think it matters how sensitive the cache
is to a *complete* flush. The question I want to ask is: how much
does it take to knock ONE page out of cache? And what are the chances
of that happening too frequently? It seems to me that if a run of 100
tuples with the same previously-unseen XID is enough to knock over the
applecart, then that's not a real high bar - you could easily hit that
limit on a single page. And if that isn't enough, then I don't
understand the algorithm.

> If your concern is the cost of replacement, then you can
> micro-optimize (a hand rolled qsort alone should squeeze out quite a
> bit of fat) or experiment with a new algorithm as you suggested.
> However i should note that at least for pgbench fsync=off oltp tests
> (the only way I could think of to exercise cache replacement), it
> (replacement costs) did not show up on the radar.

It might be useful, just for debugging purposes, to add an elog(LOG)
in there that triggers whenever a cache flush occurs. And then play
with different workloads and try to make it happen as often as you
can. One idea I had was to hack the XID generator so that it only
returns every (say) 200th XID, instead of consecutive ones. That
might make it easier to identify the sorts of workloads that are prone
to causing problems, and then we could further analyze those workloads
and see whether they are a problem in real life, and/or what can be
done to minimize the impact.

> OTOH, If your main concern is about losing data out of the cache via
> page swap, increasing the number of cache pages (or use smaller ones)
> might work but this incrementally makes the testing the cache more
> expensive.

Yeah, I am inclined to think that going very far in that direction is
going to be a big pile of fail. TBH, I'm somewhat surprised that you
managed to get what you already have to work without a performance
regression. That code path is extremely hot, and injecting more
complexity seems like it ought to be a loser. For purposes of this
review, I'm taking it as given that you've tested this carefully, but
I'll admit to lingering skepticism.

> I did briefly experiment with trying to bootstrap incoming cache pages
> with data gathered out of the clog -- but it didn't help much and was
> a lot more trouble than worth.

I believe it.

> One possible enhancement would be to try and bias pages with more data
> in them so that it's more difficult to get them out -- or even
> maintain separate pools with ejection priority.  I'm not in a hurry
> and suggestions are welcome.

I think the basic problem is that the cache pages are so large. It's
hard to make them smaller because that increases the cost of accessing
the cache, as you already noted, but at the same time, making an
eviction decision on a cache that holds 64K entries based on 100 data
points seems like waving around a pretty large-caliber weapon in a
fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
it's hard to avoid the niggling fear that someone might accidentally
get shot.

Just for the sake of argument, let's suppose we made an array with 64K
elements, each one representing one 64K chunk of the XID space. Each
element is a 4-byte unsigned integer, so the array takes up 256kB of
memory... probably not good, but I'm just thinking out loud here.
Every time we're asked about an XID, we bump the corresponding
counter. Periodically (say, every 64K probes) we zip through all the
counters and consider performing a cache replacement. We look at the
not-already-cached page that has the highest counter value and compare
it to the counter value for the cached pages. If it is higher and the
difference exceeds some safety margin (to protect against hysteresis),
then we do the replacement. I think something like this would allow
you to have a lot more information available to make a direct
apples-to-apples comparison at cache replacement time, as compared
with what you have now.

Now, the big problem here is that a 256kB array is probably too big,
but because we don't want to blow out the lower-level CPU caches and
also because every time we consider performing a cache replacement
we're going to want to zero the whole thing, and that has to be fast.
But we probably don't really need the array to cover the entire XID
space. vacuum_freeze_min_age defaults to 50 million transactions, and
vacuum_freeze_table_age defaults to 150 million transactions. If we
only concern ourselves with the last, say, 256 million transactions,
which ought to be way more than we will need for almost any practical
purpose, then that's just 1/16th of the total XID space. If we only
worry about the last 64 million transactions, then that's just 1/64th
of the total XID space. Possibly we could go take an even smaller
slice than that. At any rate now you're talking about a much smaller
array, although sadly with a bit more bookkeeping to keep track of
which portion of the XID space we're currently tracking. And maybe
you could also use 2-byte counters rather than 4-byte counters, if
you're careful to set it up so they can't overflow.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-06-29 20:22:01 serializable installcheck
Previous Message Alvaro Herrera 2011-06-29 20:03:41 Re: Patch file questions?