Re: Performance problems testing with Spamassassin 3.1.0

From: John Arbash Meinel <john(at)arbash-meinel(dot)com>
To: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-07-31 05:31:54
Message-ID: 42EC624A.2020308@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Matthew Schumacher wrote:

>Tom Lane wrote:
>
>
>
>>I looked into this a bit. It seems that the problem when you wrap the
>>entire insertion series into one transaction is associated with the fact
>>that the test does so many successive updates of the single row in
>>bayes_vars. (VACUUM VERBOSE at the end of the test shows it cleaning up
>>49383 dead versions of the one row.) This is bad enough when it's in
>>separate transactions, but when it's in one transaction, none of those
>>dead row versions can be marked "fully dead" yet --- so for every update
>>of the row, the unique-key check has to visit every dead version to make
>>sure it's dead in the context of the current transaction. This makes
>>the process O(N^2) in the number of updates per transaction. Which is
>>bad enough if you just want to do one transaction per message, but it's
>>intolerable if you try to wrap the whole bulk-load scenario into one
>>transaction.
>>
>>I'm not sure that we can do anything to make this a lot smarter, but
>>in any case, the real problem is to not do quite so many updates of
>>bayes_vars.
>>
>>How constrained are you as to the format of the SQL generated by
>>SpamAssassin? In particular, could you convert the commands generated
>>for a single message into a single statement? I experimented with
>>passing all the tokens for a given message as a single bytea array,
>>as in the attached, and got almost a factor of 4 runtime reduction
>>on your test case.
>>
>>BTW, it's possible that this is all just a startup-transient problem:
>>once the database has been reasonably well populated, one would expect
>>new tokens to be added infrequently, and so the number of updates to
>>bayes_vars ought to drop off.
>>
>> regards, tom lane
>>
>>
>>
>
>The spamassassins bayes code calls the _put_token method in the storage
>module a loop. This means that the storage module isn't called once per
>message, but once per token.
>
>
Well, putting everything into a transaction per email might make your
pain go away.
If you saw the email I just sent, I modified your data.sql file to add a
"COMMIT;BEGIN" every 1000 selects, and I saw a performance jump from 18
minutes down to less than 2 minutes. Heck, on my machine, the advanced
perl version takes more than 2 minutes to run. It is actually slower
than the data.sql with commit statements.

>I'll look into modifying it to so that the bayes code passes a hash of
>tokens to the storage module where they can loop or in the case of the
>pgsql module pass an array of tokens to a procedure where we loop and
>use temp tables to make this much more efficient.
>
>
Well, you could do that. Or you could just have the bayes code issue
"BEGIN;" when it starts processing an email, and a "COMMIT;" when it
finishes. From my testing, you will see an enormous speed improvement.
(And you might consider including a fairly frequent VACUUM ANALYZE)

>I don't have much time this weekend to toss at this, but will be looking
>at it on Monday.
>
>
Good luck,
John
=:->

>Thanks,
>
>schu
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: don't forget to increase your free space map settings
>
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Dirk Lutzebäck 2005-07-31 10:11:02 Re: Performance problems on 4/8way Opteron (dualcore) HP
Previous Message John Arbash Meinel 2005-07-31 05:27:06 Re: Performance problems testing with Spamassassin 3.1.0