Performance problems testing with Spamassassin 3.1.0 Bayes module.

From: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Performance problems testing with Spamassassin 3.1.0 Bayes module.
Date: 2005-07-27 22:35:20
Message-ID: 42E80C28.2040608@aptalaska.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I'm not sure how much this has been discussed on the list, but wasn't
able to find anything relevant in the archives.

The new Spamassassin is due out pretty soon. They are currently testing
3.1.0pre4. One of the things I hope to get out of this release is bayes
word stats moved to a real RDBMS. They have separated the mysql
BayesStore module from the PgSQL one so now postgres can use it's own
queries.

I loaded all of this stuff up on a test server and am finding that the
bayes put performance is really not good enough for any real amount of
mail load.

The performance problems seems to be when the bayes module is
inserting/updating. This is now handled by the token_put procedure.

After playing with various indexes and what not I simply am unable to
make this procedure perform any better. Perhaps someone on the list can
spot the bottleneck and reveal why this procedure isn't performing that
well or ways to make it better.

I put the rest of the schema up at
http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
needs to see it too.

CREATE OR REPLACE FUNCTION put_token(integer, bytea, integer, integer,
integer) RETURNS bool AS '
DECLARE
inuserid ALIAS for $1;
intoken ALIAS for $2;
inspam_count ALIAS for $3;
inham_count ALIAS for $4;
inatime ALIAS for $5;
got_token record;
updated_atime_p bool;
BEGIN
updated_atime_p := FALSE;
SELECT INTO got_token spam_count, ham_count, atime
FROM bayes_token
WHERE id = inuserid
AND token = intoken;
IF NOT FOUND THEN
-- we do not insert negative counts, just return true
IF (inspam_count < 0 OR inham_count < 0) THEN
RETURN TRUE;
END IF;
INSERT INTO bayes_token (id, token, spam_count, ham_count, atime)
VALUES (inuserid, intoken, inspam_count, inham_count, inatime);
IF NOT FOUND THEN
RAISE EXCEPTION ''unable to insert into bayes_token'';
return FALSE;
END IF;
UPDATE bayes_vars SET token_count = token_count + 1
WHERE id = inuserid;
IF NOT FOUND THEN
RAISE EXCEPTION ''unable to update token_count in bayes_vars'';
return FALSE;
END IF;
UPDATE bayes_vars SET newest_token_age = inatime
WHERE id = inuserid AND newest_token_age < inatime;
IF NOT FOUND THEN
UPDATE bayes_vars
SET oldest_token_age = inatime
WHERE id = inuserid
AND oldest_token_age > inatime;
END IF;
return TRUE;
ELSE
IF (inspam_count != 0) THEN
-- no need to update atime if it is < the existing value
IF (inatime < got_token.atime) THEN
UPDATE bayes_token
SET spam_count = spam_count + inspam_count
WHERE id = inuserid
AND token = intoken
AND spam_count + inspam_count >= 0;
ELSE
UPDATE bayes_token
SET spam_count = spam_count + inspam_count,
atime = inatime
WHERE id = inuserid
AND token = intoken
AND spam_count + inspam_count >= 0;
IF FOUND THEN
updated_atime_p := TRUE;
END IF;
END IF;
END IF;
IF (inham_count != 0) THEN
-- no need to update atime is < the existing value or if it was
already updated
IF inatime < got_token.atime OR updated_atime_p THEN
UPDATE bayes_token
SET ham_count = ham_count + inham_count
WHERE id = inuserid
AND token = intoken
AND ham_count + inham_count >= 0;
ELSE
UPDATE bayes_token
SET ham_count = ham_count + inham_count,
atime = inatime
WHERE id = inuserid
AND token = intoken
AND ham_count + inham_count >= 0;
IF FOUND THEN
updated_atime_p := TRUE;
END IF;
END IF;
END IF;
IF updated_atime_p THEN
UPDATE bayes_vars
SET oldest_token_age = inatime
WHERE id = inuserid
AND oldest_token_age > inatime;
END IF;
return TRUE;
END IF;
END;
' LANGUAGE 'plpgsql';

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2005-07-28 00:12:15 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.
Previous Message Simon Riggs 2005-07-27 22:00:55 Re: wal_buffer tests in