Quick Links

Re: Some regular-expression performance hacking

From:	"Joel Jacobson" <joel(at)compiler(dot)org>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Some regular-expression performance hacking
Date:	2021-02-18 19:58:07
Message-ID:	d8899c66-5492-43e9-8eff-a07d1d593ad0@www.fastmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
> >> I've produced a new dataset which now also includes the regex flags (if
> >> any) used for each subject applied to a pattern.
>
> Again, thanks for collecting this data! I'm a little confused about
> how you produced the results in the "tests" table, though. It sort
> of looks like you tried to feed the Javascript flags to regexp_match(),
> which unsurprisingly doesn't work all that well.

That's exactly what I did. Some of the flags work the same between Javascript and PostgreSQL, others don't.

I thought maybe something interesting would surface in just trying them blindly.

Flags that aren't supported and gives errors are reported as tests where error is not null.

Most patterns have no flags, and second most popular is just the "i" flag, which should work the same.

SELECT flags, COUNT(*) FROM patterns GROUP BY 1 ORDER BY 2 DESC;
flags | count
-------+--------
| 151927
i | 120336
gi | 26057
g | 13263
gm | 4606
gim | 699
im | 491
y | 367
m | 365
gy | 105
u | 50
giy | 38
giu | 20
gimu | 14
iy | 11
iu | 6
gimy | 3
gu | 2
gmy | 2
imy | 1
my | 1
(21 rows)

This query shows what Javascript-regex-flags that could be used as-is without errors:

SELECT
patterns.flags,
COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
GROUP BY 1
ORDER BY 2;

flags | count
-------+---------
im | 2534
m | 4460
i | 543598
| 2704177
(4 rows)

I considered filtering/converting the flags to PostgreSQL,
maybe that would be an interesting approach to try as well.

>
> Even discounting
> that, I'm not getting quite the same results, and I don't understand
> why not. So how was that made from the raw "patterns" and "subjects"
> tables?

The rows in the tests table were generated by the create_regexp_tests() function [1]

Each subject now has a foreign key to a specific pattern,
where the (pattern, flags) combination are unique in patterns.
The actual unique constraint is on (pattern_hash, flags) to avoid
an index directly on pattern which can be huge as we've seen.

So, for each subject, it is known via the pattern_id
exactly what flags were used when the regex was compiled
(and later executed/applied with the subject).

[1] https://github.com/truthly/regexes-in-the-wild/blob/master/create_regexp_tests.sql

>
> > PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> > Time: 758514.703 ms (12:38.515)
> > Time: 755883.600 ms (12:35.884)
> > Time: 746522.107 ms (12:26.522)
> >
> > PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> > HEAD (4e703d671)
> > Time: 519620.646 ms (08:39.621)
> > Time: 518998.366 ms (08:38.998)
> > Time: 519696.129 ms (08:39.696)
>
> Hmmm ... we haven't yet committed any performance-relevant changes to the
> regex code, so it can't take any credit for this improvement from 13.2 to
> HEAD. I speculate that this is due to some change in our parallelism
> stuff (since I observe that this query is producing a parallelized hash
> plan). Still, the next drop to circa 2:21 runtime is impressive enough
> by itself.

OK. Another factor might perhaps be the PostgreSQL 10, 11, 12, 13 versions were compiled elsewhere,
I used the OS X binaries from https://postgresapp.com/, whereas version 14 I of course compiled myself.
Maybe I should have compiled 10, 11, 12, 13 myself instead, for a better comparison,
but I mostly just wanted to verify if I could find any differences, the performance comparison was a bonus.

>
> > Heh, what a funny coincidence:
> > The regex I used to shrink the very-long-pattern,
> > actually happens to run a lot faster with the patches.
>
> Yeah, that just happens to be a poster child for the MATCHALL idea:
>
> > EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
>
> Each of the parenthesized subexpressions of the RE is successfully
> recognized as being MATCHALL, with length range 1..80 for two of them and
> 0..infinity for the middle one. That means the engine doesn't have to
> physically scan the text to determine whether a possible division point
> satisfies the sub-regexp; and that means we can find the correct division
> points in O(N) not O(N^2) time.

Very nice.

Like you said earlier, perhaps the regex engine has been optimized enough for this time.
If not, you want to investigate an additional idea,
that I think can be seen as a generalization of the optimization trick for (.*),
if I've understood how it works correctly.

Let's see if I can explain the idea:

One of the problems with representing regexes with large bracket range expressions, like [a-z],
is you get an explosion of edges, if the model can only represent state transitions for single characters.

If we could instead let a single edge (for a state transition) represent a set of characters,
or normally even more efficiently, a set of range of characters, then we could reduce the
number of edges we need to represent the graph.

The naive approach to just use the ranges as-is doesn't work.

Instead, the graph must first be created with single-character edges.

It is then examined what ranges can be constructed in a way that no single range
overlaps any other range, so that every range can be seen as a character in an alphabet.

Perhaps a bit of fiddling with some examples is easiest
to get a grip of the idea.

Here is a live demo of the idea:
https://compiler.org/reason-re-nfa/src/index.html

The graphs are rendered live when typing in the regex,
using a Javascript port of GraphViz.

For example, try entering the regex: t[a-z]*m

This generates this range-optimized graph for the regex:

/--[a-ln-su-z]-----------------\
|/------t--------------------\ |
|| | |
-->(0)--t-->({0,1})----m-------->({0 1 2}) | |
^---[a-ln-su-z]--/ | |
^-------t-------/ | |
^---------------------------/ |
^-----------------------------/
Notice how the [a-z] bracket expression has been split up,
and we now have 3 distinct set of "ranges":
t
m
[a-ln-su-z]

Since no ranges are overlapping, each such range can safely be seen as a letter in an alphabet.

Once we have our final graph, but before we proceed to generate the machine code for it,
we can shrink the graph further by merging ranges together, which eliminate some edges:

/--------------\
| |
--->(0)--t-->(1)<--[a-ln-z]--/
|^-[a-lnz]-\
\----m-->((2))<----\
| |
\---m---/

Notice how [a-ln-su-z]+t becomes [a-ln-z].

Another optimization I've come up with (or probably re-invented because it feels quite obvious),
is to read more than one character, when knowing for sure multiple characters-in-a-row
are expected, by concatenating edges having only one parent and one child.

In our example, we know for sure at least two characters will be read for the regex t[a-z]*m,
so with this optimization enabled, we get this graph:

/--[a-ln-z]
| |
--->(0)---t[a-ln-z]--->(1)<---+--[a-ln-z]
| | /
| \---m--->((2))<------\
\--------------tm------------^ | |
\----m----/

This makes not much difference for a few characters,
but if we have a long pattern with a long sentence
that is repeated, we could e.g. read in 32 bytes
and compare them all in one operation,
if our machine had 256-bits SIMD registers/instructions.

This idea has also been implemented in the online demo.

There is a level which can be adjusted
from 0 to 32 to control how many bytes to merge at most,
located in the "[+]dfa5 = merge_linear(dfa4)" step.

Anyway, I can totally understand if you've had enough of regex optimizations for this time,
but in case not, I wanted to share my work in this field, in case it's interesting to look at now or in the future.

/Joel

In response to

Re: Some regular-expression performance hacking at 2021-02-18 18:10:47 from Tom Lane

Responses

Re: Some regular-expression performance hacking at 2021-02-18 20:44:07 from Joel Jacobson
Re: Some regular-expression performance hacking at 2021-02-18 20:44:47 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jacob Champion	2021-02-18 20:33:18	Re: Support for NSS as a libpq TLS backend
Previous Message	Amichai Amar	2021-02-18 19:13:34	many-to-many problem