Re: Pathological regexp match

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pathological regexp match
Date: 2010-01-30 07:07:46
Message-ID: 25719.1264835266@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> writes:
> We came across a regexp that takes very much longer than expected.

I poked into this a little bit. What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion. In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime. With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second. This brings the
runtime down from minutes to sub-second on my machine. However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them. Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

regards, tom lane

Attachment Content-Type Size
regexp.patch text/x-patch 2.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2010-01-30 09:33:43 Re: PG 9.0 and standard_conforming_strings
Previous Message Mark Mielke 2010-01-30 06:25:59 Re: PG 9.0 and standard_conforming_strings