Skip site navigation (1) Skip section navigation (2)

Re: Performance problems testing with Spamassassin 3.1.0

From: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-07-30 21:06:49
Message-ID: 42EBEBE9.4020504@aptalaska.net (view raw or flat)
Thread:
Lists: pgsql-performance
Tom Lane wrote:

> I looked into this a bit.  It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
> 49383 dead versions of the one row.)  This is bad enough when it's in
> separate transactions, but when it's in one transaction, none of those
> dead row versions can be marked "fully dead" yet --- so for every update
> of the row, the unique-key check has to visit every dead version to make
> sure it's dead in the context of the current transaction.  This makes
> the process O(N^2) in the number of updates per transaction.  Which is
> bad enough if you just want to do one transaction per message, but it's
> intolerable if you try to wrap the whole bulk-load scenario into one
> transaction.
> 
> I'm not sure that we can do anything to make this a lot smarter, but
> in any case, the real problem is to not do quite so many updates of
> bayes_vars.
> 
> How constrained are you as to the format of the SQL generated by
> SpamAssassin?  In particular, could you convert the commands generated
> for a single message into a single statement?  I experimented with
> passing all the tokens for a given message as a single bytea array,
> as in the attached, and got almost a factor of 4 runtime reduction
> on your test case.
> 
> BTW, it's possible that this is all just a startup-transient problem:
> once the database has been reasonably well populated, one would expect
> new tokens to be added infrequently, and so the number of updates to
> bayes_vars ought to drop off.
> 
> 			regards, tom lane
> 

The spamassassins bayes code calls the _put_token method in the storage
module a loop.  This means that the storage module isn't called once per
message, but once per token.

I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.

I don't have much time this weekend to toss at this, but will be looking
at it on Monday.

Thanks,

schu

In response to

Responses

pgsql-performance by date

Next:From: John Arbash MeinelDate: 2005-07-31 05:27:06
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Previous:From: Tom LaneDate: 2005-07-30 18:28:53
Subject: Re: Performance problems testing with Spamassassin 3.1.0

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group