Re: WIP: lookbehind constraints for our regexp engine

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: WIP: lookbehind constraints for our regexp engine
Date: 2015-10-28 21:34:30
Message-ID: 1662.1446068070@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I've occasionally heard complaints that our regex engine only has
> lookahead constraints not lookbehind constraints, while Perl's for example
> does have those. While I was fooling around with the other issues in that
> code, I learned enough to realize that it would not be that hard to
> implement such things, and the attached proof-of-concept patch does so
> (using Perl-compatible syntax, in case you're wondering).

> However, I don't feel like this is done to the point of being committable,
> because its performance leaves something to be desired in a lot of cases.
> The difficulty is that because the engine can only match in a forward
> direction, in the worst case you have to check a lookbehind constraint by
> trying to match starting from the string start to see if a match exists
> ending at the current-location where you're testing the constraint. That
> makes it, worst case, O(N^2) to search a string of length N.

Attached is an updated patch that eliminates the O(N^2) performance issue
by creating a third version of the core text-matching functions (previously
just longest() and shortest()). This new version is able to suspend and
resume a scan without significant overhead, and it reports whether a match
exists ending at the current stop point. So we effectively run a
lookbehind constraint's NFA over the string just once, no matter how many
places we actually need to check the constraint's satisfaction at.

This is still not fast in an absolute sense: the limiting scan rate for a
pattern with leading lookbehind constraint (which means the constraint
must be checked at each character) seems to be roughly 10X slower than
for simple constraint-free patterns. For example

regression=# select repeat('xyzzy', 1000000) ~ '(?<=xy)y';
?column?
----------
f
(1 row)

Time: 601.079 ms
regression=# select repeat('xyzzy', 1000000) ~ 'xyy';
?column?
----------
f
(1 row)

Time: 64.213 ms

(As a comparison point, perl 5.10.1 does the same searches in about 250ms
and 30ms respectively; though I'm not comparing apples to oranges exactly
since this is a debug build of PG versus production perl.)

This is not entirely the fault of the lookbehind code, though; most of the
cycles are being spent in miss() in the outer DFA scan, not in checking
the LACON itself. That's a consequence of Henry Spencer's decision that
any state transition involving a LACON check should be treated as a cache
miss. It's not terribly hard to imagine having a code path like "check
this LACON and go to one of two cached states depending on whether it
succeeds". However, I couldn't figure a way to shoehorn that into the
code without adding both cycles to the basic text search loops and bloat
to the stateset cache data structure, costs that would be paid whether or
not you're actually using any LACONs. So maybe Henry made the right
optimization tradeoff.

Also, I put in a compile-time optimization for lookbehind (and lookahead)
constraints that consist of just a single character or bracket expression,
eg these run at about the speed of a simple pattern:

regression=# select repeat('xyzzy', 1000000) ~ '(?<=[ab])y';
?column?
----------
f
(1 row)

Time: 51.936 ms
regression=# select repeat('xyzzy', 1000000) ~ '(?<![xz])y';
?column?
----------
f
(1 row)

Time: 56.149 ms

That's feasible because the constraint can be reduced to a BEHIND (or
AHEAD) constraint, which was a concept that already existed in Henry's
code and doesn't need the LACON machinery at all. (Perl doesn't seem to
have any comparable optimization, btw, so that we are actually beating it
by a wide margin in these cases. Not to mention it fails entirely on
variable-length lookbehind patterns.)

In any case I think this is now a credible candidate to commit as-is.
Making lookbehinds go still faster is likely not the most useful place
to put in effort if someone is looking to micro-optimize regex execution
speed.

Comments?

regards, tom lane

Attachment Content-Type Size
regex-lookbehind-2.0.patch text/x-diff 46.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2015-10-28 21:39:50 Re: Is there any ordering to the values in guc.c?
Previous Message Bill Moran 2015-10-28 21:33:57 Re: Is there any ordering to the values in guc.c?