Re: Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-24 02:32:50
Message-ID: 4019408.1614133970@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here's another little piece of regex performance hacking. This is based
on looking at a slow regexp I found in Tcl's bug tracker:

-- Adapted from http://core.tcl.tk/tcl/tktview?name=446565
select regexp_matches(
repeat('<script> 123 </script> <script> 345 </script> <script> 123 </script>',
100000),
'<script(.(?!</script>))*?(doubleclick|flycast|burstnet|spylog)+?.*?</script>');

The core of the problem here is the lookahead constraint (?!</script>),
which gets applied O(N^2) times for an N-character data string. The
present patch doesn't do anything to cut down the big-O problem, but
it does take a swipe at cutting the constant factor, which should
remain useful even if we find a way to avoid the O(N^2) issue.

Poking at this with perf, I was surprised to observe that the dominant
cost is not down inside lacon() as one would expect, but in the loop
in miss() that is deciding where to call lacon(). 80% of the runtime
is going into these three lines:

for (i = 0; i < d->nstates; i++)
if (ISBSET(d->work, i))
for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)

So there are two problems here. The outer loop is iterating over all
the NFA states, even though only a small fraction of the states are
likely to have LACON out-arcs. (In the case at hand, the main NFA
has 78 states, of which just one has LACON out-arcs.) Then, for
every reachable state, we're scanning all its out-arcs to find the
ones that are LACONs. (Again, just a fraction of the out-arcs are
likely to be LACONs.) So the main thrust of this patch is to rearrange
the "struct cnfa" representation to separate plain arcs from LACON
arcs, allowing this loop to not waste time looking at irrelevant
states or arcs. This also saves some time in miss()'s preceding
main loop, which is only interested in plain arcs. Splitting the
LACON arcs from the plain arcs complicates matters in a couple of
other places, but none of them are in the least performance-critical.

The other thing I noticed while looking at miss() is that it will
call lacon() for each relevant arc, even though it's quite likely
to see multiple arcs labeled with the same constraint number,
for which the answer must be the same. So I added some simple
logic to cache the last answer and re-use it if the next arc of
interest has the same color. (We could imagine working harder
to cache in the presence of multiple interesting LACONs, but I'm
doubtful that it's worth the trouble. The one-entry cache logic
is so simple it can hardly be a net loss, though.)

On my machine, the combination of these two ideas reduces the
runtime of the example above from ~150 seconds to ~53 seconds,
or nearly 3x better. I see something like a 2% improvement on
Joel's test corpus, which might just be noise. So this isn't
any sort of universal panacea, but it sure helps when LACON
evaluation is the bottleneck.

Any objections? or better ideas?

regards, tom lane

Attachment Content-Type Size
0006-reduce-constant-factor-for-lacons.patch text/x-diff 16.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-02-24 02:51:36 Re: Fallback table AM for relkinds without storage
Previous Message Michael Paquier 2021-02-24 02:27:40 Re: some pointless HeapTupleHeaderIndicatesMovedPartitions calls