Another regexp performance improvement: skip useless paren-captures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: "Joel Jacobson" <joel(at)compiler(dot)org>
Subject: Another regexp performance improvement: skip useless paren-captures
Date: 2021-08-04 22:15:34
Message-ID: 2219936.1628115334@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here's a little finger exercise that improves a case that's bothered me
for awhile. In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine. That'd be fine
if capturing were cheap, but with our engine it is not particularly
cheap. In many situations, the initial DFA check is sufficient to
tell whether there is an overall match, but it does not tell where any
subexpression match boundaries are. To identify exactly which substring
is deemed to match a parenthesized subexpression, we have to recursively
break down the match, which takes at the very least a few more DFA
invocations; and with an uncooperative regex, it can easily result in
O(N^2) behavior where there was none at the DFA stage.

Therefore, we really ought to expend some effort to not capture
subexpressions if the sub-match data is not actually needed, which in
many invocations we know that it isn't. Spencer's original code has
a REG_NOSUB option that looks like it ought to be good for this ... but
on closer inspection it's basically useless, because it turns *all*
parens into non-capturing ones. That breaks back-references, so unless
you know that the regexp contains no back-refs, you can't use it.

The attached proposed patch redefines REG_NOSUB as being a regexp-
compile-time promise that the caller doesn't care about sub-match
locations, but not a promise that no backrefs exist. (If the
caller passes a match-locations array at execution anyway, it will
just get back -1 values, as if no sub-match had been identified.)
If that flag is passed, we run through the completed sub-regexp
tree and remove the "capture" markers on any subREs that are
not actually referenced by some backref. This typically causes
some parent subREs to no longer be deemed "messy", so that their
separate child subREs can be thrown away entirely, saving memory
space as well as runtime.

(I'd originally thought that a much more complex patch would be
needed to do this, because I assumed that re-optimizing the subRE
tree would be much more complicated than this. However, as far
as I can see this is sufficient; this change doesn't expose any
cases where additional tree restructuring would be helpful.)

Testing with Joel's handy little corpus of web regexps, there's a
useful improvement of the speed of ~ operators (a/k/a regexp_like()).
I see the total time to apply regexp_like() to all 4474520 entries
dropping from 10:17 to 5:46. Interesting statistics include

regexp=# select max(duration),avg(duration) from headresults;
max | avg
-----------------+-----------------
00:00:00.939389 | 00:00:00.000138
(1 row)

regexp=# select max(duration),avg(duration) from patchresults;
max | avg
-----------------+-----------------
00:00:00.918549 | 00:00:00.000077
(1 row)

The lower percentiles don't move much, but upper ones do:

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from headresults;
percentile_cont
-------------------------------------------------------------------
{00:00:00.000027,00:00:00.000059,00:00:00.000067,00:00:00.000108}
(1 row)

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from patchresults;
percentile_cont
-------------------------------------------------------------------
{00:00:00.000025,00:00:00.000042,00:00:00.000048,00:00:00.000065}
(1 row)

This isn't terribly surprising, because regexps that were already
really cheap probably have no capturing parens to dispense with.

Of course, there's no benefit with functions that do need sub-match
data, such as regexp_match. But the added overhead in such cases
should be quite negligible. The only downside I can see is that
if you use the "same" regexp in both submatches-needed and
non-submatches-needed contexts, you'll end up with two separate
compiled regexp cache entries. That doesn't seem like a big
problem though.

regards, tom lane

Attachment Content-Type Size
optimize-useless-captures-1.patch text/x-diff 13.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-08-04 22:24:41 Re: Lowering the ever-growing heap->pd_lower
Previous Message Mark Dilger 2021-08-04 22:03:57 Bug fix for cache lookup failure for statistic_ext type