Re: Performance problems testing with Spamassassin 3.1.0

From: Michael Parker <parkerm(at)pobox(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-08-01 04:04:49
Message-ID: 20050801040443.GA6527@mail.herk.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi All,

As a SpamAssassin developer, who by my own admission has real problem
getting PostgreSQL to work well, I must thank everyone for their
feedback on this issue. Believe me when I say what is in the tree now
is a far cry from what used to be there, orders of magnitude faster
for sure. I think there are several good ideas that have come out of
this thread and I've set about attempting to implement them.

Here is a version of the stored procedure, based in large part by the
one written by Tom Lane, that accepts and array of tokens and loops
over them to either update or insert them into the database (I'm not
including the greatest_int/least_int procedures but you've all seen
them before):

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
intokenary BYTEA[],
inspam_count INTEGER,
inham_count INTEGER,
inatime INTEGER)
RETURNS VOID AS '
DECLARE
_token BYTEA;
new_tokens INTEGER := 0;
BEGIN
for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
_token := intokenary[i];
UPDATE bayes_token
SET spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count, 0),
atime = greatest_int(atime, inatime)
WHERE id = inuserid
AND token = _token;
IF NOT FOUND THEN
-- we do not insert negative counts, just return true
IF NOT (inspam_count < 0 OR inham_count < 0) THEN
INSERT INTO bayes_token (id, token, spam_count,
ham_count, atime)
VALUES (inuserid, _token, inspam_count, inham_count, inatime);
IF FOUND THEN
new_tokens := new_tokens + 1;
END IF;
END IF;
END IF;
END LOOP;

UPDATE bayes_vars
SET token_count = token_count + new_tokens,
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
WHERE id = inuserid;
RETURN;
END;
' LANGUAGE 'plpgsql';

This version is about 32x faster than the old version, with the
default fsync value and autovacuum running in the background.

The next hurdle, and I've just posted to the DBD::Pg list, is
escaping/quoting the token strings. They are true binary strings,
substrings of SHA1 hashes, I don't think the original data set
provided puts them in the right context. They have proved to be
tricky. I'm unable to call the stored procedure from perl because I
keep getting a malformed array litteral error.

Here is some example code that shows the issue:
#!/usr/bin/perl -w

# from a new db, do this first
# INSERT INTO bayes_vars VALUES (1,'nobody',0,0,0,0,0,0,2147483647,0);

use strict;
use DBI;
use DBD::Pg qw(:pg_types);
use Digest::SHA1 qw(sha1);

my $dbh = DBI->connect("DBI:Pg:dbname=spamassassin","postgres") || die;

my @dataary;

# Input is just a list of words (ie /usr/share/dict/words) stop after 150
while(<>) {
chomp;
push(@dataary, substr(sha1($_), -5));
# to see it work with normal string comment out above and uncomment below
# push(@dataary, $_);
last if scalar(@dataary) >= 150;
}

my $datastring = join(",", map { '"' . bytea_esc($_) . '"' }
@dataary);
my $sql = "select put_tokens(1, '{$datastring}', 1, 1, 10000)";
my $sth = $dbh->prepare($sql);
my $rc = $sth->execute();
unless ($rc) {
print "Error: " . $dbh->errstr() . "\n";
}
$sth->finish();

sub bytea_esc {
my ($str) = @_;
my $buf = "";
foreach my $char (split(//,$str)) {
if (ord($char) == 0) { $buf .= "\\\\000"; }
elsif (ord($char) == 39) { $buf .= "\\\\047"; }
elsif (ord($char) == 92) { $buf .= "\\\\134"; }
else { $buf .= $char; }
}
return $buf;
}

Any ideas? or thoughts on the revised procedure? I'd greatly
appriciate them.

Sorry for the length, but hopefully it give a good enough example.

Thanks
Michael Parker

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2005-08-01 04:42:30 Re: Performance problems testing with Spamassassin 3.1.0
Previous Message William Yu 2005-07-31 21:11:58 Re: Performance problems on 4/8way Opteron (dualcore)