Re: Postgresql Caching

From: Harvell F <fharvell(at)file13(dot)info>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Shane Ambler <pgsql(at)007Marketing(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgresql Caching
Date: 2006-10-16 01:58:35
Message-ID: 24C93EEC-0509-4EF5-90CD-99C1D836E56C@file13.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 15 Oct 2006, at 19:55, mark(at)mark(dot)mielke(dot)cc wrote:

> On Mon, Oct 16, 2006 at 07:00:20AM +0930, Shane Ambler wrote:
>> mark(at)mark(dot)mielke(dot)cc wrote:
>>> As a thought experiment, I'm not seeing the benefit. I think if you
>>> could prove a benefit, then any proof you provided could be used to
>>> improve the already existing caching layers, and would apply equally
>>> to read-only or read-write pages. For example, why not be able to
>>> hint to PostgreSQL that a disk-based table should be considered a
>>> priority to keep in RAM. That way, PostgreSQL would avoid pushing
>>> pages from this table out.
>> If memcached (or pgmemcached implemented in triggers) can show a
>> speed
>> improvement using ram based caching (even with network overhead) of
>> specific data then it stands to reason that this ram based cache
>> can be
>> integrated into postgres with better integration that will
>> overcome the
>> issues that pgmemcached has.
>
> ...
> I think the memcache people are thinking that the cost of
> PostgreSQL is
> about the disk. Although the disk plays a part, I'm pretty sure it's
> only a fraction. Not providing transaction guarantees, not
> providing an
> SQL level abstraction, and not having multiple processes or threads
> plays a much bigger part.
>

Forgive my intrusion and perhaps simplistic viewpoint, however,
improved caching would be of great benefit for me as a web developer.

I wholeheartedly agree that the disk IO is often a small part of
the expense of obtaining data from the database, especially for the
nominal web based application. Query parsing, joining, sorting, etc.
are all likely to be real culprits. The existing caching mechanism
(as I understand them) and especially the kernel disk caches do
nothing to eliminate these overhead costs.

I would venture that the 80/20 rule applies here as in many, many
other instances. A full 80+% of the queries performed against the
database are performed over and over and over again with the same
criteria for a period of time and then the criteria changes for the
next period of time. This would be particularly true for seldom
changed tables that, for example, contain a list of the day's
advertisements. The data is changed slowly, once a day or once a
week, but, a query is made for every page hit. Usually the exact
same query.

I know, for you purists out there, that this is an obvious call
for an application level cache. Perhaps so, however, it complicates
the end-programmer environment _and_ it has consistency
disadvantages. Many of the programming languages being used provide
direct interfaces to PostgreSQL (not surprising given that the
programmers are using PostgreSQL) and some may even provide a caching
mechanism. Best case, integrating the two remains a task for the end-
programmer, worse case, the end-programmer has to implement the cache
as well. Rolling a cache into the database removes that complexity
by incorporating it into the existing PostgreSQL API. (BTW, I'm
aware that the consistency disadvantages of the application level
cache could probably be overcome by implementing notify in the cache
but, again, at added end-programmer expense.)

Getting back to the original posting, as I remember it, the
question was about seldom changed information. In that case, and
assuming a repetitive query as above, a simple query results cache
that is keyed on the passed SQL statement string and that simply
returns the previously cooked result set would be a really big
performance win.

Registering each cache entry by the tables included in the query
and invalidating the cache during on a committed update or insert
transaction to any of the tables would, transparently, solve the
consistency problem.

Does this improve the "more interesting" case of heavily updated
tables? Not likely, however, for many web applications, it will
likely improve 80% of the queries leaving more cycles (and bandwidth)
available for the non-cacheable queries.

There would be other issues as well, for example, non-invalidated
cache entries will accumulate rapidly if the criteria changes often,
large result sets will cause cache contention, cursors will (likely)
be unable to use the cache, syntax/commands for manipulating
cacheability, etc. THIS DOES NOT ELIMINATE THE BASIC VALUE of a
results cache for the conditions specified above. Conditions that I
would venture to say make up a large part of the queries that are (or
could be) made by a web application.

Thanks,
F

--
F Harvell
407 467-1919

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message HS 2006-10-16 03:08:18 hacking postgres hashjoin algorithm
Previous Message Jeremy Drake 2006-10-16 01:33:36 Re: Postgresql Caching