WIP: lookbehind constraints for our regexp engine

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WIP: lookbehind constraints for our regexp engine
Date: 2015-10-16 23:52:18
Message-ID: 5404.1445039538@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've occasionally heard complaints that our regex engine only has
lookahead constraints not lookbehind constraints, while Perl's for example
does have those. While I was fooling around with the other issues in that
code, I learned enough to realize that it would not be that hard to
implement such things, and the attached proof-of-concept patch does so
(using Perl-compatible syntax, in case you're wondering).

However, I don't feel like this is done to the point of being committable,
because its performance leaves something to be desired in a lot of cases.
The difficulty is that because the engine can only match in a forward
direction, in the worst case you have to check a lookbehind constraint by
trying to match starting from the string start to see if a match exists
ending at the current-location where you're testing the constraint. That
makes it, worst case, O(N^2) to search a string of length N. Now, you can
improve on that greatly if you can determine that possible matches don't
extend all the way back to the string start. In the attached patch I use
the "cold start pointer" returned by shortest() to decide that characters
before the coldstart point no longer have to be rechecked. That helps
some, but it's not good enough because there are too many cases where the
engine doesn't move the coldstart point forward very aggressively.

It might be that that behavior itself can be improved on, which would
be nice because there are also non-lookbehind-related scenarios where
smarter coldstart detection would help performance. But I have no very
good ideas about how to do it.

Another idea I've been toying with is to add logic that tries to determine
the maximum possible match length for a regex. If we can determine that
for the lookbehind sub-regex, then we'd know we have to back up at most
that far before applying the match check. This seems promising because
a lot of other engines don't even support variable-length lookbehind
patterns at all, and so we'd be assured of good performance in all the
cases that competitors can handle at all.

Anyway, I'm not planning to do much more work on this right now, but
I thought I'd throw it out there just to let people know that this seems
within reach. I'm curious to know how many people care, and how much.

regards, tom lane

Attachment Content-Type Size
regex-lookbehind-1.0.patch text/x-diff 36.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2015-10-17 00:45:44 Re: Parallel Seq Scan
Previous Message Michael Paquier 2015-10-16 23:48:39 Re: pageinspect patch, for showing tuple data