Re: Query meltdown: caching results

From: Gordon <gordon(dot)mcvey(at)ntlworld(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Query meltdown: caching results
Date: 2008-02-27 15:57:54
Message-ID: 7499e6cf-4680-4dfc-809e-10245f4dd734@f47g2000hsd.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Feb 27, 10:37 am, Gordon <gordon(dot)mc(dot)(dot)(dot)(at)ntlworld(dot)com> wrote:
> On Feb 26, 5:26 pm, st(dot)(dot)(dot)(at)enterprisedb(dot)com (Gregory Stark) wrote:
>
>
>
> > "Norman Peelman" <npeel(dot)(dot)(dot)(at)cfl(dot)rr(dot)com> writes:
> > >> My options are, as far as I can tell,
>
> > >> 1) replace the Database PDO extending class with something else that
> > >> provides query results caching in PHP, or
> > >> 2) get Postgres itself to cache the results of queries to avoid
> > >> running them repeatedly during a session.
>
> > You might consider looking at memcached. One way to use it would be to have
> > the PHP application check for the cached object first and use it rather than
> > do any database queries. Then you can use pgmemcached to allow triggers to
> > invalidate cached objects whenever the underlying data changes. (Or you could
> > even just use pl/php to update or invalidate the cached object through the
> > same code library)
>
> > --
> > Gregory Stark
> > EnterpriseDB http://www.enterprisedb.com
> > Ask me about EnterpriseDB's Slony Replication support!
>
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: explain analyze is your friend
>
> Thanks for the replies, but the problem is really centered around how
> my script interacts with the database. I know what the problem is,
> and if I was using a structured approach I could figure out
> workarounds fairly quickly, but I chose to do this project in OOP for
> a few reasons (the old CMS was hacked together in a hurry by an
> employee who is now long gone, is horribly written and completely
> unmaintainable, the size of the project seemed to warrant an OOP
> approach, we recently upgraded to PHP 5 and I wanted to take advantage
> of the new features, PDO and prepared statements appeared from the
> documentation to offer major security and performance enhancements,
> and I wanted to increase my experience with OOP development as most
> work I've done before now has been structured) and the high level of
> compartmentalization that OOP demands is coming back to haunt me
> now.
>
> The problem comes when a document is published or unpublished. All
> the documents that relate to that document need to be republished to
> add or remove the link to the document in question. When the document
> is published or unpublished the script gets related documents (at the
> moment other documents that share the same folder, but this may change
> in the future to cover more related content) and republishes them.
>
> Each document has associated data, such as the ID of the template to
> apply, its parent, its path back to root, etc, that are generated by
> querying the database in various ways. For example, the route back to
> path is fetched by iterativly getting the parent folder, getting that
> folder's parent, etc until the script hits the route. Templates are
> fetched by looking ath the template ID associated with the document.
> If this is 0, then the script goes to the parent folder and gets the
> template associated with the folder. If that is 0 as well then it
> iterativly goes up until it finds a template to apply or until it hits
> the root, in which case it applies a default template. The code
> fragment from the script that does this looks like this:
>
> $db -> beginTransaction ();
> if ($doc = CmsItem::factory ('CmsDoc', intval ($_GET ['itm_id']),
> $db, $user))
> {
> if ((($doc -> itemProps ['itm_publish']) && ($doc -> unpublish ()))
> || ($doc -> publish ()))
> {
> // Republish related documents
> foreach ($doc -> getSiblingObjects () as $thisDoc)
> {
> if ((get_class ($thisDoc) == 'CmsDoc')
> && ($thisDoc -> itemProps ['itm_publish']))
> {
> $thisDoc -> republish ();
> }
> }
> // Item status changed
> $db -> commit ();
> $_SESSION ['messageStack'][] = ($doc -> itemProps ['itm_publish']?
> 'Item published':
> 'Item unpublished');
> }
> else
> {
> // Couldn't change published status
> $db -> rollback ();
> $_SESSION ['messageStack'][] = ($doc -> itemProps ['itm_publish']?
> 'Unable to unpublish item':
> 'Unable to publish item');
> }
>
> }
>
> GetSiblingObjects () runs a query that gets a list of IDs that share
> the same parent as the current document. It then iterates the list
> and spawns a new CMS item for each item in the list and returns them
> as an array. As folders could be returned as well as documents we
> only run republish () on those items.
>
> CmsDoc -> publish () and CmsDoc -> unpublish () toggle a boolean
> column in the database between true and false for the item being (un)
> published. unpublish () also deletes the concrete file associated
> with the DB entry.
>
> publish () and republish () write out a concrete HTML file based on
> the content stored in the table for the document in question and the
> template that should be applied. The template is determined from a
> template ID column. If it's 0 then the script walks up the tree until
> it finds a template to use as described above.
>
> publish () and republish () rely on a method of CmsDoc called generate
> (), which creates a Smarty instance, does the work described above and
> generates a HTML string for publish () and republish () to write out
> to disk.
>
> public function generate ()
> // Generate document HTML
> {
> $paths = $this -> getTemplatePaths ();
> $page = new Template ($paths ['tplPath'], $paths ['cplPath']);
> $page -> assign_by_ref ('document', $this -> itemProps);
> $page -> assign ('siblings', $this -> getSiblings ());
> $page -> assign ('path', $this -> getTrail ());
> return ($page -> fetch ($paths ['tplFile']));
> }
>
> Template is a class that inherits from Smarty, that just does a little
> setup housekeeping for me and allows me to specify both the template
> and the compile dirs from arguments instead of just the template dir.
> getTemplatePaths returns where templates are stored and where they'll
> be compiled. getSiblings () is like getSiblingObjects () but doesn't
> create objects, just returns a list of siblings. getTrail returns an
> array of folders fro mthe root to the document's containing folder.
>
> The problem is that there is a vast number of redundant queries being
> run. The documents all share the same parent (the list was obtained
> in the first place with getSiblingObjects () but for every one that
> gets republished the getSiblings () and getTrail () queries get run
> (As every document in a folder can have a different template the
> getTemplatePaths () query would have to be run for every one anyway).
> All the solutions I can think of for this problem would involve
> breaking the principle that objects should not depend on a knowledge
> of the inner workings of unrelated objects. the only thing these
> objects are guaranteed to all have in common is that they all share
> the same Database object between them (Database is my PDO extending
> class). This suggests to me that the best solution is to cache the
> results of read queries either in the database engine itself, or in
> the Database object. Results caching would eliminate the problem of
> the same queries beign run over and over because only the first
> invocation of a query would actually query the database. All
> subsequent queries would hit the cache instead.
>
> The problem is the mechanics of actually implementing this caching.
> I'm using prepared statements almost exclusivly throughout the design,
> meaning that the PDOStatement class probably needs to be extended
> somehow and my Database prepare() method should return my extended
> class instead of a generic PDOStatement. The Database class does
> caching of PDOStatements to save the overhead of each instance of a
> class attempting to prepare the same statement over and over again.
> If I can extend PDOStatement to cache its results the nthe repeated
> query overhead will be more or less eliminated. The problem is what
> happens when a database modifying event occurs? The easiest thing to
> do would be to invalidate the cache in all existing PDOStatements,
> which while it's not the most efficiant solution would be the safest.
> but when one PDOStatement executes a modifying query then how will all
> the other PDOStatements know?
>
> This is why I was interested in caching at the database level, as it
> would sidestep this problem.
>
> I can't have been the first person to run up against this problem,
> somebody somewhere must have implemented a framework for PDO that can
> handle caching of results from prepared statements. I've just not been
> able to find one. Please, if anyone out there knows something I don't
> then please let me know.
>
> PS: For those of you who replied via private mail instead of in the
> group, thank you for the responses, I appreciate the help and advice.
> but could you please reply in the group? Those messages get sent to
> my home address, and I'm posting this from work, so I won't see them
> until I get home. Besides, it'll be helpful for the next poor sod who
> hits this problem if there's a publicly viewable thread that'll turn
> up in Google :)

After a lot of hairpulling, I finally found a mechanism in PHP for
doing what I wanted. I just had to know 2 things:

1) How to get the PDO engine to use my customized prepared statement
class instead of PDOStatement
2) Extending PDOStatement to transparently add results caching is too
difficult and complex in the timeframe required

Once I knew these things I made a PDOStatement extension class that
instead of trying to transparently add caching to the existing methods
added a couple of new ones instead. Code below:

<?php

class Statement extends PDOStatement
{
private $resultCache = array ();
private $database = NULL;
public $hits = 0;
public $misses = 0;

public function ask (array $params = array ())
// Executes a prepared statement on the database that fetches data
{
$hash = md5 (implode (',', $params));
if (!$this -> resultCache [$hash])
{
$this -> misses++;
// Execute the query
if ($this -> execute ($params))
{
// Cache the results
$this -> resultCache [$hash] = $this -> fetchAll
(PDO::FETCH_ASSOC);
}
}
else
{
$this -> hits++;
}
return ($this -> resultCache [$hash]);
}
public function tell (array $params = array ())
// Execute a prepared statement that causes the database to be
modified
{
// Execute the query
if ($this -> execute ($params))
{
$rowCount = $this -> rowCount ();
if ($rowCount)
{
// Tell the parent Database object to clear statement caches
$this -> database -> clearResults ();
}
return ($rowCount);
}
}
public function clearResults ()
// Clear cache
{
$this -> resultCache = array ();
}
private function __construct ($db)
// Class constructor
{
$this -> database = $db;
//print_r ($this);
}
}

class Database extends PDO
// Adds some extra functionality to the built in PDO class
{
private $statementCache = array ();
private $txCount = 0;
private $txErr = false;

// Prepared statement cacheing
public function prepare ($statement, array $options = array ())
{
$hash = md5 ($statement);
if ((!isset ($this -> statementCache [$hash]))
|| (!is_object ($this -> statementCache [$hash])))
{
//echo ('Preparing statement "'. $statement .'"<br>');
$this -> statementCache [$hash] = parent::prepare ($statement,
$options);
}
else
{
//echo ('Statement "' . $statement . '" already prepared<br>');
}
return ($this -> statementCache [$hash]);
}
public function clearResults ()
// Clear the results cache of all associated prepared statements
{
foreach ($this -> statementCache as $thisStatement)
{
$thisStatement -> clearResults ();
}
}
// Add support for transaction nesting
public function beginTransaction ()
{
if (($this -> txCount == 0) && (!$this -> txErr))
{
$result = parent::beginTransaction ();
}
$this -> txCount ++;
if (DEBUG_TX)
{
echo ('begin: ' . $this -> txCount . ' transaction(s)<br />');
}
return ($result);
}
public function commit ()
{
$this -> txCount --;
if ($this -> txCount <= 0)
{
$this -> txErr? $result = parent::rollback (): $result =
parent::commit ();
$this -> txErr = false;
}
if (DEBUG_TX)
{
echo ('commit: ' . $this -> txCount . ' transaction(s)<br />');
}
return ($result);
}
public function rollback ()
{
$this -> txErr = true;
$this -> txCount --;
if ($this -> txCount <= 0)
{
$result = parent::rollback ();
$this -> txErr = false;
}
if (DEBUG_TX)
{
echo ('rollback: ' . $this -> txCount . ' transaction(s)<br />');
}
return ($result);
}
// Housekeeping
private function removeExpiredLocks ()
{
$query = 'DELETE FROM cms_locks
WHERE lck_timestamp + lck_duration < NOW();';
$preparedQuery = $this -> prepare ($query);
if ($preparedQuery -> execute ())
{
return ($preparedQuery -> rowCount ());
}
}
// Class constructor
public function __construct ()
{
parent::__construct ( 'pgsql:host=' . CFG_DB_HOST
.' port=' . CFG_DB_PORT
.' dbname=' . CFG_DB_DBNAME,
CFG_DB_USERNAME,
CFG_DB_PASSWORD);
/*
The documentation for the following line on php.net is really bad!

What is does is set what class will be used to handle prepared
statements. By default
the PDO -> prepare() command returns a prepared statement as a
PDOStatement class object.
We want to extend PDOStatements to provide some extra functionality,
so when we run the
prepare () method, we want to return something other than a
PDOStatement.

This line tells the database to use our Statement class for prepared
statements instead
of the PDOStatement class. It also passes a reference to the
database object that
spawned it to the constructor as an argument. We'll use this
reference in the Statement
class to send messages back to the Database class
*/
$this -> setAttribute (PDO::ATTR_STATEMENT_CLASS, array
('Statement', array ($this)));
$this -> setAttribute (PDO::ATTR_ERRMODE, PDO::ERRMODE_WARNING);
$this -> query ('SET search_path = ' . CFG_DB_PATH . ';');
$this -> removeExpiredLocks ();
}
}

The system works like this: I use my Database class where I would
have used PDO to create a database connection. When I prepare() a
statement I either get a new prepared statement back, or the
previously created one is returned. The prepared statement contains
an array that caches the results produced with various parameters. My
prepared statement class also adds 2 database querying methods, ask ()
and tell (), that I use where I normally would have used an execute
(). If I want to retrieve information from the database I ask () it
to return the data matching my parameters. Whenever I want to make a
change I tell () it the data I want saved to the database. The ask ()
method checks the statement's local cache of results and if it finds
one matching the passed parameters, it returns it. If not, then it
execute()s the query to retrieve them.

When I tell () the database to make a change the statement that
received the message to do so sends a message to its parent Database
object. The object goes through its list of cached prepared
statements and sends them messages telling them to dump the contents
of their caches.

It's not an ideal solution, if two different queries produce identical
result sets or ones that overlap in some way then they get cached
twice, and the cache invalidation mechanism is extremely primitive (if
any change is made, dump everything). Nonetheless this little caching
system has had quite a dramatic effect on performance. It's still not
as fast as I would like though. :)

I reposted the code here in the hope that somebody finds it useful,
and/or in case somebody has suggestions for improvement. While the
code itself was quite simple in the end, finding the information
needed to make it work wasn't. The ATTR_STATEMENT_CLASS parameter is
not well documented. Hopefully others can learn from my difficulties
here and learn from them.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Lew 2008-02-27 16:28:32 Re: reindexing
Previous Message paul rivers 2008-02-27 15:47:31 Re: Query meltdown: caching results