Re: Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-15 03:11:37
Message-ID: 2155152.1613358697@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joel Jacobson" <joel(at)compiler(dot)org> writes:
> Below is the result of the performance test query:
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
> Time: 592196.145 ms (09:52.196)
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
> Time: 461739.364 ms (07:41.739)
> That's an impressive 22% speed-up!

I've been doing some more hacking over the weekend, and have a couple
of additional improvements to show. The point of these two additional
patches is to reduce the number of "struct subre" sub-regexps that
the regex parser creates. The subre's themselves aren't that large,
so this might seem like it would have only small benefit. However,
each subre requires its own NFA for the portion of the RE that it
matches. That adds space, and it also adds compilation time because
we run the "optimize()" pass separately for each such NFA. Maybe
there'd be a way to share some of that work, but I'm not very clear
how. In any case, not having a subre at all is clearly better where
we can manage it.

0003 is a small patch that fixes up parseqatom() so that it doesn't
emit no-op subre's for empty portions of a regexp branch that are
adjacent to a "messy" regexp atom (that is, a capture node, a
backref, or an atom with greediness different from what preceded it).

0004 is a rather larger patch whose result is to get rid of extra
subre's associated with alternation subre's. If we have a|b|c
and any of those alternation branches are messy, we end up with

*
/ \
a *
/ \
b *
/ \
c NULL

where each "*" is an alternation subre node, and all those "*"'s have
identical NFAs that match the whole a|b|c construct. This means that
for an N-way alternation we're going to need something like O(N^2)
work to optimize all those NFAs. That's embarrassing (and I think
it's my fault --- if memory serves, I put in this representation
of messy alternations years ago).

We can improve matters by having just one parent node for an
alternation:

*
\
a -> b -> c

That requires replacing the binary-tree structure of subre's
with a child-and-sibling arrangement, which is not terribly
difficult but accounts for most of the bulk of the patch.
(I'd wanted to do that for years, but up till now I did not
think it would have any real material benefit.)

There might be more that can be done in this line, but that's
as far as I got so far.

I did some testing on this using your dataset (thanks for
giving me a copy) and this query:

SELECT
pattern,
subject,
is_match AS is_match_head,
captured AS captured_head,
subject ~ pattern AS is_match_patch,
regexp_match(subject, pattern) AS captured_patch
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern)
OR captured IS DISTINCT FROM regexp_match(subject, pattern));

I got these runtimes (non-cassert builds):

HEAD 313661.149 ms (05:13.661)
+0001 297397.293 ms (04:57.397) 5% better than HEAD
+0002 151995.803 ms (02:31.996) 51% better than HEAD
+0003 139843.934 ms (02:19.844) 55% better than HEAD
+0004 95034.611 ms (01:35.035) 69% better than HEAD

Since I don't have all the tables used in your query, I can't
try to reproduce your results exactly. I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.

Anyway, I'm feeling pretty pleased with these results ...

regards, tom lane

Attachment Content-Type Size
0001-invent-rainbow-arcs-3.patch text/x-diff 19.6 KB
0002-recognize-matchall-NFAs-3.patch text/x-diff 21.7 KB
0003-remove-useless-concat-nodes.patch text/x-diff 3.6 KB
0004-make-subre-trees-Nary.patch text/x-diff 22.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Seamus Abshere 2021-02-15 03:15:04 A reloption for partitioned tables - parallel_workers
Previous Message Masahiro Ikeda 2021-02-15 02:59:48 Re: About to add WAL write/fsync statistics to pg_stat_wal view