Skip site navigation (1) Skip section navigation (2)

Re: Pathological regexp match

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pathological regexp match
Date: 2010-01-30 07:07:46
Message-ID: 25719.1264835266@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackers
Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> writes:
> We came across a regexp that takes very much longer than expected.

I poked into this a little bit.  What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion.  In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime.  With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second.  This brings the
runtime down from minutes to sub-second on my machine.  However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them.  Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

			regards, tom lane


Attachment: regexp.patch
Description: text/x-patch (2.2 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Peter EisentrautDate: 2010-01-30 09:33:43
Subject: Re: PG 9.0 and standard_conforming_strings
Previous:From: Mark MielkeDate: 2010-01-30 06:25:59
Subject: Re: PG 9.0 and standard_conforming_strings

Privacy Policy | About PostgreSQL
Copyright © 1996-2013 The PostgreSQL Global Development Group