Re: Database Caching

From: Greg Copeland <greg(at)CopelandConsulting(dot)Net>
To: mlw <markw(at)mohawksoft(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Database Caching
Date: 2002-03-01 23:06:20
Message-ID: 3C80096C.8050302@copelandconsulting.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm sneaking out of my cave here.. ;)

mlw wrote:
> Tom Lane wrote:
>
>>mlw <markw(at)mohawksoft(dot)com> writes:
>>
>>>My previous company, www.dmn.com, has a music database system. We logged all
>>>the backed info, most of the queries were duplicated many times. This can be
>>>explained by multiple users interested in the same thing or the same user
>>>hitting "next page"
>>>
>>>If you could cache the "next page" or similar hit results, you could really
>>>increase throughput and capaciy of a website.

Well, I'm going to assume that the records in question have been
completely cached in cases like this. So isn't the primary source of
improvement the query parse and plan generation? As Tom seems to think,
wouldn't it make more sense to optimize the parse/plan generation rather
than caching the result set? After all, if the plan can be pinned how
much of a performance boost do you expect to get from processing a
cached plan versus returning a cached result set?

Seriously, I am curious as to what the expected return is? Still a
multiple or simply some minor percent?

>>>
>>Sure, but the most appropriate place to do that sort of thing is in the
>>application (in this case, probably a cgi/php-ish layer). Only the
>>application can know what its requirements are. In the case you
>>describe, it'd be perfectly okay for a "stale" cache result to be
>>delivered that's a few minutes out of date. Maybe a few hours out of
>>date would be good enough too, or maybe not. But if we do this at the
>>database level then we have to make sure it won't break *any*
>>applications, and that means the most conservative validity assumptions.
>>(Thus all the angst about how to invalidate cache entries on-the-fly.)
>>
>>Likewise, the application has a much better handle than the database on
>>the issue of which query results are likely to be worth caching.
>>
>>I think that reports of "we sped up this application X times by caching
>>query results on the client side" are interesting, but they are not good
>>guides to what would happen if we tried to put a query-result cache into
>>the database.
>>
>
> I would like to respectfully differ with you here. If query results are cached
> in an ACID safe way, then many things could be improved.
>
> The problem with applications caching is that they do not have intimate
> knowledge of the database, and thus do not know when their cache is invalid. On
> top of that, many web sites have multiple web servers connected to a single
> database. The caching must sit between the web software and the DB. The logical
> place for caching is in the database.
>

But hybrid application cache designs can mostly if not completely
address this and also gets some added benefits in many cases. If you
have a "cache" table which denotes the tables which are involved in the
cached results that you desire, you can then update it's state via
triggers or even exposed procedures accordingly to reflect if the client
side cache has been invalidated or not. This means that a client need
only query the cache table first to determine if it's cache is clean or
dirty. When it's dirty, it mearly needs to query the result set again.

Let's also not forget that client side caching can also yield
significant networking performance improvements over a result set that
is able to be cached on the server. Why? Well, let's say a query has a
result set of 10,000 rows which are being cached on the server. A
remote client queries and fetches 10,0000 results over the network. Now
then, even though the result set is cached by the database, it is still
being transfered over the wire for each and every query. Now then,
let's assume that 10 other people perform this same query. That's
100,000 rows which get transfered across the wire. With the client side
caching scheme, you have 10,010 rows (initial 10,000 result set lpus a
single row result set which indicates the status of the cache) returned
across the wire which tell the client that it's cache is clean or dirty.

Let's face it, in order for the cache to make sense, the same result set
needs to be used over and over again. In these cases, it would seem
like in real world situations, a strong client side hybrid caching
scheme wins in most cases.

I'd also like to toss out that I'd expect somewhere there would be a
trade off between data cache and result set cache. On systems without
infinite memory, where's the line of deliniation? It seems somewhere
you may be limiting the size of the generalized cache at the expense of
the cached result sets. If this happens, the cases where a cached
result set may be improved but refreshing that result set my be hindered
as well might all other queries on the system.

> If we went even further, and cached multiple levels of query, i.e. the result
> of the sub-select within the whole query, then things like views and more
> complex queries would could get an increase in performance.
>
> Take this query:
>
> select * from (select * from T1 where field = 'fubar') as Z right outer join
> (select alt from T2, (select * from T1 where field = 'fubar') as X where
> T2.key = X.key) as Y
> on T3.key = Y.key) on (Z.key = Y.alt) where Z.key = NULL;
>
>
> Forgive this query, it is probably completely wrong, the actual query it is
> intended to represent is quite a bit larger. The intention is to select a set
> of alternate values based on a set of initial values, but also eliminating any
> alternate values which may also be in the initial set. Anyway, we have to query
> "Select * from T1 where field = 'fubar'" twice.
>
> If that subselect could be cached, it could speed up the query a bit. Right now
> I use a temp table, which is a hassle.
>

It's funny you say that because I was thinking that should server side
result set caching truely be desired, wouldn't the use of triggers, a
procedure for client interface and a temporary table be a poor-man's
implementation yeilding almost the same results? Though, I must say I'm
assuming that the queries will *be* the same and *not nearly* the same.
But in the web world, isn't this really the situation we're trying to
address? That is, give me the front page?

> Caching results can and do speed up duplicate queries, there can really be no
> argument about it. The argument is about the usefulness of the feature and the
> cost of implementing it. If maintaining the cache costs more than the benefit
> of having it, obviously it is a loser. If implementing it takes up the
> biological CPU cycles of he development team that would be spent doing more
> important things, then it is also a loser. If however, it is relatively "easy"
> (hehe) to do, and doesn't affect performance greatly, is there any harm in
> doing so?
>

Are any of the ideas that I put forth a viable substitute?

Greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew McMillan 2002-03-01 23:56:22 Re: Bug #605: timestamp(timestamp('a timestamp)) no longer works
Previous Message Bruce Momjian 2002-03-01 20:42:04 Re: elog() patch