Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Some regular-expression performance hacking
Date: 2021-02-11 04:39:43
Message-ID: 1340281.1613018383@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As I mentioned in connection with adding the src/test/modules/test_regex
test code, I've been fooling with some performance improvements to our
regular expression engine. Here's the first fruits of that labor.
This is mostly concerned with cutting the overhead for handling trivial
unconstrained patterns like ".*".

0001 creates the concept of a "rainbow" arc within regex NFAs. You can
read background info about this in the "Colors and colormapping" part of
regex/README, but the basic point is that right now, representing a dot
(".", match anything) within an NFA requires a separate arc for each
"color" (character equivalence class) that the regex needs. This uses
up a fair amount of storage and processing effort, especially in larger
regexes which tend to have a lot of colors. We can replace such a
"rainbow" of arcs with a single arc labeled with a special color
RAINBOW. This is worth doing on its own account, just because it saves
space and time. For example, on the reg-33.15.1 test case in
test_regex.sql (a moderately large real-world RE), I find that HEAD
requires 1377614 bytes to represent the compiled RE, and the peak space
usage during pg_regcomp() is 3124376 bytes. With this patch, that drops
to 1077166 bytes for the RE (21% savings) with peak compilation space
2800752 bytes (10% savings). Moreover, the runtime for that test case
drops from ~57ms to ~44ms, a 22% savings. (This is mostly measuring the
RE compilation time. Execution time should drop a bit too since miss()
need consider fewer arcs; but that savings is in a cold code path so it
won't matter much.) These aren't earth-shattering numbers of course,
but for the amount of code needed, it seems well worth while.

A possible point of contention is that I exposed the idea of a rainbow
arc in the regexport.h APIs, which will force consumers of that API
to adapt --- see the changes to contrib/pg_trgm for an example. I'm
not too concerned about this because I kinda suspect that pg_trgm is
the only consumer of that API anywhere. (codesearch.debian.net knows
of no others, anyway.) We could in principle hide the change by
having the regexport functions expand a rainbow arc into one for
each color, but that seems like make-work. pg_trgm would certainly
not see it as an improvement, and in general users of that API should
appreciate recognizing rainbows as such, since they might be able to
apply optimizations that depend on doing so.

Which brings us to 0002, which is exactly such an optimization.
The idea here is to short-circuit character-by-character scanning
when matching a sub-NFA that is like "." or ".*" or variants of
that, ie it will match any sequence of some number of characters.
This requires the ability to recognize that a particular pair of
NFA states are linked by a rainbow, so it's a lot less painful
to do when rainbows are represented explicitly. The example that
got me interested in this is adapted from a Tcl trouble report:

select array_dims(regexp_matches(repeat('x',40) || '=' || repeat('y',50000),
'^(.*)=(.*)$'));

On my machine, this takes about 6 seconds in HEAD, because there's an
O(N^2) effect: we try to match the sub-NFA for the first "(.*)" capture
group to each possible starting string, and only after expensively
verifying that tautological match do we check to see if the next
character is "=". By not having to do any per-character work to decide
that .* matches a substring, the O(N^2) behavior is removed and the time
drops to about 7 msec.

(One could also imagine fixing this by rearranging things to check for
the "=" match before verifying the capture-group matches. That's an
idea I hope to look into in future, because it could help for cases
where the variable parts are not merely ".*". But I don't have clear
ideas about how to do that, and in any case ".*" is common enough that
the present change should still be helpful.)

There are two non-boilerplate parts of the 0002 patch. One is the
checkmatchall() function that determines whether an NFA is match-all,
and if so what the min and max match lengths are. This is actually not
very complicated once you understand what the regex engine does at the
"pre" and "post" states. (See the "Detailed semantics" part of
regex/README for some info about that, which I tried to clarify as part
of the patch.) Other than those endpoint conditions it's just a
recursive graph search. The other hard part is the changes in
rege_dfa.c to provide the actual short-circuit behavior at runtime.
That's ticklish because it's trying to emulate some overly complicated
and underly documented code, particularly in longest() and shortest().
I think that stuff is right; I've studied it and tested it. But it
could use more eyeballs.

Notably, I had to add some more test cases to test_regex.sql to exercise
the short-circuit part of matchuntil() properly. That's only used for
lookbehind constraints, so we won't hit the short-circuit path except
with something like '(?<=..)', which is maybe a tad silly.

I'll add this to the upcoming commitfest.

regards, tom lane

Attachment Content-Type Size
0001-invent-rainbow-arcs.patch text/x-diff 19.6 KB
0002-recognize-matchall-NFAs.patch text/x-diff 17.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-11 04:46:08 Re: Operands don't affect result (CONSTANT_EXPRESSION_RESULT) (src/backend/utils/adt/jsonfuncs.c)
Previous Message Peter Geoghegan 2021-02-11 03:50:40 Re: 64-bit XIDs in deleted nbtree pages