Re: Regexps - never completing join.

From: Rusty Conover <rconover(at)infogears(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Regexps - never completing join.
Date: 2008-05-14 15:33:31
Message-ID: 8103338D-08B9-43A9-A2DC-C7CEDC0994AB@infogears.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Returning to this problem this morning, I made some more insight.

The regexp cache isn't getting very many hits because the executor is
looping through all of the classification rows then looping through
all of the regular expressions, causing each expression to be
recompiled every time since the cache limit is only for 32 cached
regular expressions. You can think of the behavior like:

foreach classification {
foreach regexp {
do match
}
}

Obviously to make this perform better without requiring a bigger
regexp cache I'd like it to run like:

foreach regexp {
foreach classification {
do match
}
}

That way the cache wouldn't have to be very big at all since the last
used regular expression would be at the top of the cache.

Various methods of changing the query don't seem to have the desired
effect. Even with setting join_collapse_limit to 1.

select wc_rule.id from wc_rule cross join classifications on
classifications.classification ~* wc_rule.regexp;

QUERY PLAN
-----------------------------------------------------------------------------
Nested Loop (cost=13.71..891401.71 rows=197843 width=4)
Join Filter: (classifications.classification ~* wc_rule.regexp)
-> Seq Scan on classifications (cost=0.00..1093.46 rows=56446
width=42)
-> Materialize (cost=13.71..20.72 rows=701 width=22)
-> Seq Scan on wc_rule (cost=0.00..13.01 rows=701 width=22)
(5 rows)

select wc_rule.id from classifications cross join wc_rule on
classifications.classification ~* wc_rule.regexp;

QUERY PLAN
-----------------------------------------------------------------------------
Nested Loop (cost=13.71..891401.71 rows=197843 width=4)
Join Filter: (classifications.classification ~* wc_rule.regexp)
-> Seq Scan on classifications (cost=0.00..1093.46 rows=56446
width=42)
-> Materialize (cost=13.71..20.72 rows=701 width=22)
-> Seq Scan on wc_rule (cost=0.00..13.01 rows=701 width=22)
(5 rows)

Both of those queries execute in the same looping order, there doesn't
seem to be a control to say use this table as the inner table and this
table as the outer table for the join that I could find.

One way I did find that worked to control the loop (but doesn't yield
the same results because its a left join)

select wc_rule.id from wc_rule left join classifications on
classifications.classification ~* wc_rule.regexp;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=1149.91..891457.45 rows=197843 width=4)
(actual time=0.627..149051.505 rows=55126 loops=1)
Join Filter: (classifications.classification ~* wc_rule.regexp)
-> Seq Scan on wc_rule (cost=0.00..13.01 rows=701 width=22)
(actual time=0.030..1.272 rows=701 loops=1)
-> Materialize (cost=1149.91..1714.37 rows=56446 width=42)
(actual time=0.001..14.244 rows=56446 loops=701)
-> Seq Scan on classifications (cost=0.00..1093.46
rows=56446 width=42) (actual time=0.022..29.913 rows=56446 loops=1)
Total runtime: 149067.764 ms
(6 rows)

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Doug Eck 2008-05-14 20:09:48 I/O on select count(*)
Previous Message PFC 2008-05-14 11:10:34 Re: postgres overall performance seems to degrade when large SELECT are requested