Quick Links

Regex performance regression induced by match-all code

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	"Joel Jacobson" <joel(at)compiler(dot)org>
Subject:	Regex performance regression induced by match-all code
Date:	2021-05-01 19:46:02
Message-ID:	3483895.1619898362@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I found a nasty performance problem in commit 824bf7190: given the
right sort of regex, checkmatchall() takes an unreasonable amount
of time. For example,

regression=# SELECT regexp_matches('', '(.|){20}','');
regexp_matches
----------------
{""}
(1 row)

Time: 129.213 ms
regression=# SELECT regexp_matches('', '(.|){25}','');
regexp_matches
----------------
{""}
(1 row)

Time: 4101.416 ms (00:04.101)
regression=# SELECT regexp_matches('', '(.|){30}','');
regexp_matches
----------------
{""}
(1 row)

Time: 130803.927 ms (02:10.804)

That's quite awful compared to v13, where these cases take
only a couple ms.

Worse still, you can't get out of it with control-C, because
checkmatchall_recurse lacks any check-for-interrupt.

The problem here is basically that we're willing to recursively
examine all paths out of the same NFA state over and over, once for
each possible way of arriving at that state. That's dumb and we can
do better, though it takes a little more code and some more memory.
The attached patch applies a few different methods to make this
better:

* Before starting the recursive search, do a linear-time pass
through all the states to check whether there are any non-RAINBOW
arcs. This allows failing fast for most non-matchall NFAs.

* Memo-ize the results of the recursive search, by storing an
array of possible path lengths for each state after we've examined
it once.

* Rewrite the checks for pseudocolor arcs to make them linear
time rather than O(N^2) --- I decided I'd better do that too,
after noting that the problem cases had fairly large numbers
of pre-state outarcs. This makes them cheap enough to do
before the recursive search not after.

* Put a heuristic upper bound on the number of NFA states for
which we'll attempt this optimization at all. The main reason
for this is to bound the amount of memory we can expend on
memoization results. I think that it will result in little
if any degradation in our ability to recognize matchall NFAs,
because of the existing restriction that we can't represent
cases involving path lengths that are finite but more than DUPINF.
If there are a lot more than DUPINF states then (I think) it becomes
pretty likely that we'd fail due to that restriction anyhow.
However, I've not made much attempt to quantify that argument;
I just chose DUPINF * 4 out of the air.

* Just in case that's not enough to fix things, add a cancel check
within checkmatchall_recurse.

The main thing I find a bit ugly about this solution is that
I'm abusing the state->tmp fields (which are declared struct state *)
to hold the memoization results (which are "bool *"). It'd be
possible to avoid that by allocating an additional "bool **" array
indexed by state number, but whether it's worth that depends on how
allergic you are to weird pointer casts.

Comments?

regards, tom lane

Attachment	Content-Type	Size
fix-exponential-cost-of-checkmatchall-1.patch	text/x-diff	18.6 KB

Responses

Re: Regex performance regression induced by match-all code at 2021-05-02 05:46:39 from Joel Jacobson

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Chapman Flack	2021-05-01 20:06:19	Re: Re: Perform COPY FROM encoding conversions in larger chunks
Previous Message	Bharath Rupireddy	2021-05-01 17:55:44	Re: Enhanced error message to include hint messages for redundant options error