From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | WIP: lookbehind constraints for our regexp engine |
Date: | 2015-10-16 23:52:18 |
Message-ID: | 5404.1445039538@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I've occasionally heard complaints that our regex engine only has
lookahead constraints not lookbehind constraints, while Perl's for example
does have those. While I was fooling around with the other issues in that
code, I learned enough to realize that it would not be that hard to
implement such things, and the attached proof-of-concept patch does so
(using Perl-compatible syntax, in case you're wondering).
However, I don't feel like this is done to the point of being committable,
because its performance leaves something to be desired in a lot of cases.
The difficulty is that because the engine can only match in a forward
direction, in the worst case you have to check a lookbehind constraint by
trying to match starting from the string start to see if a match exists
ending at the current-location where you're testing the constraint. That
makes it, worst case, O(N^2) to search a string of length N. Now, you can
improve on that greatly if you can determine that possible matches don't
extend all the way back to the string start. In the attached patch I use
the "cold start pointer" returned by shortest() to decide that characters
before the coldstart point no longer have to be rechecked. That helps
some, but it's not good enough because there are too many cases where the
engine doesn't move the coldstart point forward very aggressively.
It might be that that behavior itself can be improved on, which would
be nice because there are also non-lookbehind-related scenarios where
smarter coldstart detection would help performance. But I have no very
good ideas about how to do it.
Another idea I've been toying with is to add logic that tries to determine
the maximum possible match length for a regex. If we can determine that
for the lookbehind sub-regex, then we'd know we have to back up at most
that far before applying the match check. This seems promising because
a lot of other engines don't even support variable-length lookbehind
patterns at all, and so we'd be assured of good performance in all the
cases that competitors can handle at all.
Anyway, I'm not planning to do much more work on this right now, but
I thought I'd throw it out there just to let people know that this seems
within reach. I'm curious to know how many people care, and how much.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
regex-lookbehind-1.0.patch | text/x-diff | 36.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2015-10-17 00:45:44 | Re: Parallel Seq Scan |
Previous Message | Michael Paquier | 2015-10-16 23:48:39 | Re: pageinspect patch, for showing tuple data |