Re: Performance problems testing with Spamassassin 3.1.0

From: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-07-30 21:06:49
Message-ID: 42EBEBE9.4020504@aptalaska.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane wrote:

> I looked into this a bit. It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars. (VACUUM VERBOSE at the end of the test shows it cleaning up
> 49383 dead versions of the one row.) This is bad enough when it's in
> separate transactions, but when it's in one transaction, none of those
> dead row versions can be marked "fully dead" yet --- so for every update
> of the row, the unique-key check has to visit every dead version to make
> sure it's dead in the context of the current transaction. This makes
> the process O(N^2) in the number of updates per transaction. Which is
> bad enough if you just want to do one transaction per message, but it's
> intolerable if you try to wrap the whole bulk-load scenario into one
> transaction.
>
> I'm not sure that we can do anything to make this a lot smarter, but
> in any case, the real problem is to not do quite so many updates of
> bayes_vars.
>
> How constrained are you as to the format of the SQL generated by
> SpamAssassin? In particular, could you convert the commands generated
> for a single message into a single statement? I experimented with
> passing all the tokens for a given message as a single bytea array,
> as in the attached, and got almost a factor of 4 runtime reduction
> on your test case.
>
> BTW, it's possible that this is all just a startup-transient problem:
> once the database has been reasonably well populated, one would expect
> new tokens to be added infrequently, and so the number of updates to
> bayes_vars ought to drop off.
>
> regards, tom lane
>

The spamassassins bayes code calls the _put_token method in the storage
module a loop. This means that the storage module isn't called once per
message, but once per token.

I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.

I don't have much time this weekend to toss at this, but will be looking
at it on Monday.

Thanks,

schu

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message John Arbash Meinel 2005-07-31 05:27:06 Re: Performance problems testing with Spamassassin 3.1.0
Previous Message Tom Lane 2005-07-30 18:28:53 Re: Performance problems testing with Spamassassin 3.1.0