Need another way to do this, my sql much to slow...

From: "Jerry Wintrode" <wintrojr(at)tripos(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Need another way to do this, my sql much to slow...
Date: 2003-11-20 14:26:18
Message-ID: 4E676B0AAF74B443A18D7BC7AAB3CFFD1D44D6@s01-exch01.tripos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I am attempting to write a spam management add-on for my corporate
server, I have the database and the apps written but one part is WAY to
slow to be usefull.

The following view:

Column | Type | Modifiers
-----------------------+---------+-----------
env_sender_num | integer |
msg_from_claims_count | bigint |

is built on the follow table:

Column | Type | Modifiers
-----------------+-----------------------------+-----------
msg_sender_num | integer |
env_sender_num | integer |
msg_from_domain | character varying(255) |
env_from_domain | character varying(255) |
recorded_date | timestamp without time zone |
Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num),
record_of_claims_msg_sender_num_idx btree (msg_sender_num)

With the following SQL:

SELECT record_of_claims.env_sender_num, count(DISTINCT
record_of_claims.msg_sender_num)
AS msg_from_claims_count
FROM record_of_claims
GROUP BY record_of_claims.env_sender_num;

A sample dataset follows:

msg_sender_num | env_sender_num | msg_from_domain |
env_from_domain | recorded_date
----------------+----------------+-----------------------+--------------
---------+---------------------
1 | 1 | yahoo.com | yahoo.com
| 2003-11-18 13:21:07
2 | 2 | mx128.optinvc13y.com |
mx128.optinvc13y.com | 2003-11-18 13:21:16
3 | 3 | yahoo.com | yahoo.com
| 2003-11-18 13:21:17
4 | 4 | yahoo.com | yahoo.com
| 2003-11-18 13:21:21
5 | 5 | biomarketgroup.com |
biomarketgroup.com | 2003-11-18 13:21:24
6 | 6 | sohu.com | sohu.com
| 2003-11-18 13:21:28
7 | 7 | lycos.com | lycos.com
| 2003-11-18 13:21:38
8 | 8 | mail.expressrx.info |
mail.expressrx.info | 2003-11-18 13:21:41
9 | 9 | approveddeals.com |
approveddeals.com | 2003-11-18 13:21:41
10 | 10 | conceptholidays.co.uk |
conceptholidays.co.uk | 2003-11-18 13:21:48

The msg_sender_num and env_sender_num come from another table of unique
names of senders. What I am attempting to do is see how many times
msg_sender_num 1 claims to be a different env_sender_num. So I have to
find all the entries in msg_sender_num equal to 1 and build a count of
the distinct numbers in env_sender_num. This number is then used later
to say that if a msg_sender_num claims to be more then 2
env_sender_num's then the sender is a spammer and gets added to a list.

Everything is working fine except the SQL above. It takes WAY to long to
process on a 500000+ record database. Hell it takes 12 seconds or so on
a 50000 record database. I have included the query plan to show that the
indexes are being used.

Query Plan:

Aggregate (cost=0.00..166.16 rows=264 width=8) (actual
time=0.98..7768.19 rows=62911 loops=1)
-> Group (cost=0.00..159.57 rows=2635 width=8) (actual
time=0.56..3179.14 rows=80466 loops=1)
-> Index Scan using record_of_claims_env_sender_num_idx on
record_of_claims (cost=0.00..152.99 rows=2635 width=8) (actual
time=0.55..2240.15 rows=80466 loops=1)

Total runtime: 7931.63 msec

Is there a better, read "Faster", way to achieve this?

Jerry Wintrode
Very Newbie Postgres User

Browse pgsql-sql by date

  From Date Subject
Next Message Christoph Haller 2003-11-20 15:38:14 Re: Need another way to do this, my sql much to slow...
Previous Message Erik Thiele 2003-11-20 06:31:31 current_date timezone documentation suggestion