Regex back-reference semantics and performance

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: "Joel Jacobson" <joel(at)compiler(dot)org>
Subject: Regex back-reference semantics and performance
Date: 2021-03-01 00:53:49
Message-ID: 661609.1614560029@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I noticed that some of the slowest cases in Joel's regex test corpus
had issues with back-reference matching, and along the way to fixing
that discovered what seems to me to be a bug in the engine's handling
of back-references. To wit, what should happen if a back-reference
is to a subexpression that contains constraints? A simple example is

SELECT regexp_match('foof', '(^f)o*\1');

To my mind, the back reference is only chartered to match the literal
characters matched by the referenced subexpression. Here, since that
expression matches "f", the backref should too, and thus we should get
a match to "foof". Perl gives that answer, anyway; but our existing
code says there's no match. That's because it effectively copies the
constraints within the referenced subexpression, in addition to making
the data comparison. The "^" can't match where the second "f" is, so
we lose.

0001 attached fixes this by stripping constraint arcs out of the NFA
that's applied to the backref subre tree node.

Now, as to the performance issue ... if you load up the data in
"trouble.sql" attached, and do

SELECT regexp_matches(subject, pattern, 'g') FROM trouble;

you'll be waiting a good long time, even with our recent improvements.
(Up to now I hadn't tried the 'g' flag with Joel's test cases, so
I hadn't noticed what a problem this particular example has got.)
The reason for the issue is that the pattern is

(["'`])(?:\\\1|.)*?\1

and the subject string has a mix of " and ' quote characters. As
currently implemented, our engine tries to resolve the match at
any substring ending in either " or ', since the NFA created for
the backref can match either. That leads to O(N^2) time wasted
trying to verify wrong matches.

I realized that this could be improved by replacing the NFA/DFA
match step for a backref node with a string literal match, if the
backreference match string is already known at the time we try
to apply the NFA/DFA. That's not a panacea, but it helps in most
simple cases including this one. The way to visualize what is
happening is that we have a tree of binary concatenation nodes:

concat
/ \
capture concat
/ \
other stuff backref

Each concat node performs fast NFA/DFA checks on both its children
before recursing to the children to make slow exact checks. When we
recurse to the capture node, it records the actual match substring,
so now we know whether the capture is " or '. Then, when we recurse
to the lower concat node, the capture is available while it makes
NFA/DFA checks for its two children; so it will never mistakenly
guess that its second child matches a substring it doesn't, and
thus it won't try to do exact checking of the "other stuff" on a
match that's bound to fail later.

So this works as long as the tree of concat nodes is right-deep,
which fortunately is the normal case. It won't help if we have
a left-deep tree:

concat
/ \
concat backref
/ \
capture other stuff

because the upper concat node will do its NFA/DFA check on the backref
node before recursing to its left child, where the capture will occur.
But to get that tree, you have to have written extra parentheses:

((capture)otherstuff)\2

I don't see a way to improve that situation, unless perhaps with
massive rejiggering of the regex execution engine. But 0002 attached
does help a lot in the simple case.

(BTW, the connection between 0001 and 0002 is that if we want to keep
the existing semantics that a backref enforces constraints, 0002
doesn't work, since it won't do that.)

regards, tom lane

Attachment Content-Type Size
0001-fix-backref-semantics.patch text/x-diff 5.6 KB
0002-backref-performance-hack.patch text/x-diff 6.3 KB
trouble.sql text/plain 8.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-03-01 01:29:14 Re: Optimising latch signals
Previous Message Justin Pryzby 2021-03-01 00:46:47 Re: doc review for v14