Re: Caching of Queries

From: "Aaron Werman" <awerman2(at)hotmail(dot)com>
To: "Scott Kirkwood" <scottakirkwood(at)gmail(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Caching of Queries
Date: 2004-09-27 16:43:38
Message-ID: BAY18-DAV6m5UxP9Fek0002fd87@hotmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

From: "Scott Kirkwood" <scottakirkwood(at)gmail(dot)com>

> On Mon, 27 Sep 2004 15:03:01 +1000, Neil Conway <neilc(at)samurai(dot)com> wrote:
> > I think the conclusion of past discussions about this feature is that
> > it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> > cache when a single DML statement modifying the table in question is
> > issued. Not to mention that the feature is broken for non-deterministic
> > queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
> > on). That makes the feature close to useless for a lot of situations,
> > albeit not every situation.

Only the cache of changed tables are cleared. MySQL sanely doesn't cache
statements with unstable results. The vast majority of statements are
stable. The feature is likely to dramatically improve performance of most
applications; ones with lots of queries are obvious, but even data
warehouses have lots of (expensive) repetitious queries against static data.

>
> I think it's important to demark three levels of possible caching:
> 1) Caching of the parsed query tree
> 2) Caching of the query execute plan
> 3) Caching of the query results
>
> I think caching the query results (3) is pretty dangerous and
> difficult to do correctly.

I think it's very hard to cache results on the client side without guidance
because it is expensive to notify the client of change events. A changing
table couldn't be cached on client side without a synchronous check to the
db - defeating the purpose.

Guidance should work, though - I also think an optional client configuration
table which specified static tables would work and the cost of a sparse XOR
hash of statements to find match candidate statements would be negligible.
The list of tables would be a contract that they won't change. The fact is
that there often are a lot of completely static tables in high volume
transaction systems, and the gain of SQUID style proxying could be an
enormous performance gain (effort, network overhead, latency, DB server cont
ext switching, ...) especially in web farm and multi tiered applications
(and middleware doing caching invests so many cycles to do so).

Caching results on the server would also dramatically improve performance of
high transaction rate applications, but less than at the client. The
algorithm of only caching small result sets for tables that haven't changed
recently is trivial, and the cost of first pass filtering of candidate
statements to use a cache result through sparse XOR hashes is low. The
statement/results cache would need to be invalidated when any referenced
table is changed. This option seems like a big win.

>
> Caching of the the execute plan (2) is not dangerous but may actually
> execute more slowly by caching a bad plan (i.e. a plan not suited to
> the current data)

This concern could be resolved by aging plans out of cache.

This concern relates to an idiosyncrasy of pg, that vacuum has such a
profound effect. Anyone who has designed very high transaction rate systems
appreciates DB2 static binding, where a plan is determined and stored in the
database, and precompiled code uses those plans - and is both stable and
free of plan cost. The fact is that at a high transaction rate, we often see
the query parse and optimization as the most expensive activity. The planner
design has to be "dumbed down" to reduce overhead (and even forced to geqo
choice).

The common development philosophy in pg is expecting explicit prepares and
executes against bind variables (relatively rare, but useful in high volume
situations), and otherwise (commonly) using explicit literals in statements.
The problem here is the prepare/execute only works in monolithic
applications, and the chance of reuse of SQL statements with literals is
much lower.

(On a blue sky note, I would love to see a planner that dynamically changed
search depth of execution paths, so it could exhaustively build best plans
at low usage times and be less sophisticated when the load was higher... or
better yet, try alternatively for very high transaction frequency plans
until it found the best one in practice! The identified correct plan would
be used subsequently.)

>
> Caching of the query tree (1) to me has very little downsides (except
> extra coding). But may not have a lot of win either, depending how
> much time/resources are required to parse the SQL and lookup the
> objects in the system tables (something I've never gotten a
> satisfactory answer about). Also, some of the query cache would have
> to be cleared when DDL statements are performed.

Parse cache is obviously easy - just store the parse tree with a hash and
the SQL string. This would only help some very specific types of transaction
mixes. The issue is why go through all this trouble without caching the
plan? The same issues exist in both - the cost of matching, the need to
invalidate if objects definitions change, but the win would be so much less.

>
> -Scott
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matt Clark 2004-09-27 17:20:48 Re: Caching of Queries
Previous Message Harald Fuchs 2004-09-27 16:15:35 Re: Caching of Queries