Implementing an regex filter

From: Enrico Weigelt <weigelt(at)metux(dot)de>
To: pgsql-sql <pgsql-sql(at)postgresql(dot)org>, postgresql performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Implementing an regex filter
Date: 2007-08-08 12:16:11
Message-ID: 20070808121611.GA27936@nibiru.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-sql


Hi folks,

I've coded an auction crawler, which queries several auction
platforms for user-defined keywords. The results are then
put into queues an sorted out from there. For example each
auction query can be passed along with an maxprice value,
which is put into the result records. Every few minutes an
filter moves those result records where the article reached
the maxprice away to another queue.

The blacklist filter (which makes me headaches) moves those
records where article's title matches some regex'es to another
queue. One of the users has more than 630 blacklist entries,
linked to one regex about 24kb. I've tried differet approaches,
comparing to each single blacklist entry (join'ing the
blacklist table) or comparing against one huge regex (put the
compiled regex'es per user to an separate table). Both run
very, very long (> 15mins) and make heavy load.

My current scheme (reduced to required stuff):

* article table: article_id(oid)
title(text)
current_price(float)
...

* user table: user_id(oid)
username(text)
...

* user_results table: article_id(oid)
user_id(oid)
username(oid)
queue(oid) <-- only scanning 'incoming'
seen(boolean) <-- seen results are skipped
...

* user_db_list: username(text)
dbname(text)
data(text)
...

* heap: name(text)
data(text)

This is the explain analyze output of the compiled-regex approach:
(the compiled regex is stored in the "heap" table)

auctionwatch=> explain analyze update base.user_results set queue='FOO'
WHERE queue = 'incoming' AND
NOT seen AND
base.user_results.article_id = base.articles.inode_id AND
base.articles.end_time > current_timestamp AND
base.articles.title ~ (
SELECT data FROM base.heap WHERE name =
'blacklist.title::'||base.user_results.username);

Hash Join (cost=2131.38..7622.69 rows=22 width=56) (actual time=1040416.087..1128977.760 rows=1 loops=1)
Hash Cond: ("outer".article_id = "inner".inode_id)
Join Filter: ("inner".title ~ (subplan))
-> Seq Scan on user_results (cost=0.00..593.08 rows=11724 width=56) (actual time=0.104..518.036 rows=11189 loops=1)
Filter: ((queue = 'incoming'::text) AND (NOT seen))
-> Hash (cost=2014.41..2014.41 rows=8787 width=57) (actual time=250.946..250.946 rows=0 loops=1)
-> Seq Scan on articles (cost=0.00..2014.41 rows=8787 width=57) (actual time=0.702..232.754 rows=8663 loops=1)
Filter: (end_time > ('now'::text)::timestamp(6) with time zone)
SubPlan
-> Seq Scan on heap (cost=0.00..1.01 rows=1 width=32) (actual time=0.070..0.072 rows=1 loops=5998)
Filter: (name = ('blacklist.title::'::text || $0))

Total runtime: 1129938.362 ms

And the approach via joining the regex table:

auctionwatch=> explain analyze update base.user_results set queue = 'FOO'
WHERE queue = 'incoming' AND
NOT seen AND
base.user_results.article_id = base.articles.inode_id AND
base.articles.end_time > current_timestamp AND
base.articles.title ~ base.user_db_list.data AND
base.user_db_list.username = base.user_results.username AND
base.user_db_list.dbname = 'BLACKLIST.TITLE' ;

Hash Join (cost=3457.12..11044097.45 rows=3619812 width=56) (actual time=90458.408..126119.167 rows=2 loops=1)
Hash Cond: ("outer".username = "inner".username)
Join Filter: ("inner".title ~ "outer".data)
-> Seq Scan on user_db_list (cost=0.00..5268.16 rows=186333 width=51) (actual time=512.939..514.394 rows=634 loops=1)
Filter: (dbname = 'BLACKLIST.TITLE'::text)
-> Hash (cost=3373.49..3373.49 rows=4254 width=109) (actual time=466.177..466.177 rows=0 loops=1)
-> Hash Join (cost=2221.01..3373.49 rows=4254 width=109) (actual time=225.006..439.334 rows=6023 loops=1)
Hash Cond: ("outer".article_id = "inner".inode_id)
-> Seq Scan on user_results (cost=0.00..593.08 rows=11724 width=56) (actual time=0.155..85.865 rows=11223 loops=1)
Filter: ((queue = 'incoming'::text) AND (NOT seen))
-> Hash (cost=2099.20..2099.20 rows=9127 width=57) (actual time=205.996..205.996 rows=0 loops=1)
-> Seq Scan on articles (cost=0.00..2099.20 rows=9127 width=57) (actual time=0.373..187.468 rows=8662 loops=1)
Filter: (end_time > ('now'::text)::timestamp(6) with time zone)
Total runtime: 126921.854 ms


I'm not sure what "Total runtime" means. Is it the time the analyze
took or the query will take to execute ?

If it's really the execution time, then the second query would be
much faster (about 2mins vs. 18mins). But I really wonder, why
is processing one huge regex so dramatically slow ?

BTW: in some tables I'm using the username instead (or parallel
to) the numerical id to skip joins against the user table. But
I'm not sure if this wise for performance.

Any hints for futher optimization appreciated :)

thx
--
---------------------------------------------------------------------
Enrico Weigelt == metux IT service - http://www.metux.de/
---------------------------------------------------------------------
Please visit the OpenSource QM Taskforce:
http://wiki.metux.de/public/OpenSource_QM_Taskforce
Patches / Fixes for a lot dozens of packages in dozens of versions:
http://patches.metux.de/
---------------------------------------------------------------------

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Enrico Weigelt 2007-08-08 12:40:54 Performance on writable views
Previous Message Heikki Linnakangas 2007-08-08 08:00:48 Re: Update table performance

Browse pgsql-sql by date

  From Date Subject
Next Message Enrico Weigelt 2007-08-08 12:28:33 Re: Converting from MS Access field aliases
Previous Message novice 2007-08-08 05:50:25 Re: populate value of column