Re: Regexps - never completing join.

From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Rusty Conover" <rconover(at)infogears(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Regexps - never completing join.
Date: 2008-05-16 21:44:26
Message-ID: dcc563d10805161444u2101ac4bj7aa3a60a1841f065@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, May 16, 2008 at 3:37 PM, Rusty Conover <rconover(at)infogears(dot)com> wrote:
>
> On May 16, 2008, at 2:35 PM, Scott Marlowe wrote:
>
>> On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover(at)infogears(dot)com>
>> wrote:
>>>
>>> Returning to this problem this morning, I made some more insight.
>>>
>>> One way I did find that worked to control the loop (but doesn't yield the
>>> same results because its a left join)
>>>
>>> select wc_rule.id from wc_rule left join classifications on
>>> classifications.classification ~* wc_rule.regexp;
>>
>> If you do that and exclude the extra rows added to the right with somthing
>> like
>>
>> and wc_rule.somefield IS NOT NULL
>>
>> does it run fast and give you the same answers as the regular join?
>>
>> I'm guessing that this could be optimized to use a hash agg method of
>> joining for text, but I'm no expert on the subject.
>
> Hi Scott,
>
> It's not really a hash agg problem really just a looping inside/outside
> table selection problem.
>
> The slowdown is really the compilation of the regexp repeatedly by
> RE_compile_and_cache() because the regexps are being run on the inside of
> the loop rather then the outside. And since the regexp cache is only 32
> items big, the every match is resulting in a recompilation of the regexp
> since I have about 700 regexps.

That's not what I meant. What I meant was it seems like a good
candidate for a hash aggregate solution. I'm pretty sure pgsql can't
use hashagg for something like this right now.

If you hashagged each regexp and each column fed through it, you could
probably get good performance. but that's a backend hacker thing, not
something I'd know how to do.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-05-17 02:45:02 Re: I/O on select count(*)
Previous Message Rusty Conover 2008-05-16 21:37:29 Re: Regexps - never completing join.