A bunch of regular-expression improvements

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: A bunch of regular-expression improvements
Date: 2015-10-14 23:23:49
Message-ID: 19394.1444865029@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Some of you probably wondered where the heck the recent flurry of activity
around regular expressions (eg, commits 9fe8fe9c9e, b63fc2877) came from.
The answer is that it was mostly driven by some fuzz testing that Greg
Stark reported to the PG security list: he found various random regexp
patterns that crashed the regex compiler and/or caused it to take
unreasonable amounts of time and memory. Some of this was already known;
we had such a report back at
http://www.postgresql.org/message-id/B6F6FD62F2624C4C9916AC0175D56D880CE20F8A@jenmbs01.ad.intershop.net
and there are various similar complaints in the Tcl bug tracker.

The conclusion of the security list was that we didn't want to treat
such problems as CVE-worthy security issues, because given the history
of that code, no security-conscious DBA would allow use of regexes coming
from untrustworthy sources. (I'm not 100% on board with that reasoning
personally, though I have to agree that the warning we added about it was
prudent.) But we did do our best to fix cases where regex compilation
might be uncancelable for long intervals or might recurse to the point
of stack overflow, which were both things Greg showed to be possible.
That left us with some pretty severe performance issues, but hopefully
nothing that use of a statement timeout can't deal with.

Anyway, I got interested in whether the performance issues could be dealt
with properly, and have spent probably more time than I should've with my
nose buried in the regex engine. I've found a number of things that are
indeed fixable, and the attached set of patches are the result.

The primary thing that we found out is that the patch originally applied
to fix CVE-2007-4772 (infinite loop in pullback) is broken: not only does
it not fix all related cases, but it is capable of causing an Assert
failure. Fortunately the assertion condition is harmless in production
builds, so we didn't need to treat that as a security issue either. But
it's still possible to make pullback() go into an infinite loop that will
be broken only by running into the NFA state count limit.

Some background: what the pullback() and pushfwd() routines are trying to
do is get rid of ^ and $ constraint arcs by pushing them to the start and
stop states respectively, or dropping them as unsatisfiable if they
can't be so pushed (for example, in "a^x" there is no possible way to
satisfy the ^ constraint). Likewise AHEAD and BEHIND constraints can be
pushed to the start and stop states or discarded as either certainly
satisfied or certainly not satisfied. However, this works only if there
are no loops of constraint arcs, otherwise the transformation just keeps
pushing the same constraint around the loop, generating more and more
states. The pattern that prompted CVE-2007-4772 was "($|^)*", which
gives rise to an NFA state with ^ and $ constraint arcs that lead to
itself. So the hack that was instituted to deal with that was just to
drop such arcs as useless, which they are. (Whether or not you follow
the arc, you end up in the same state you started in, and you don't
consume any input; so it's useless to follow the arc.) However, that only
fixes trivial single-state constraint loops. It is not difficult to
construct regexes with constraint loops connecting more than one state.
Here are some examples:
(^$)*
(^(?!aa)(?!bb)(?!cc))+
The latter can be extended to construct a loop of as many states as you
please.

Also, I mentioned that the previous patch could cause an assertion
failure; the pattern "(^)+^" causes that. This happens because the
way the patch was implemented was to remove single-state loop arcs
within pull() or push(), and that subtly breaks the API contract
of those functions. If the removed loop arc was the final outarc
of its source state, and the only other outarc was another constraint
that had been passed to pull(), then pull() would remove both arcs
and then conclude it could delete the source state as well. We'd
then come back to pullback() whose nexta variable is pointing at a
now-deleted arc. Fortunately freearc() sets the type of deleted arcs
to zero, so the loop in pullback() would do nothing with the deleted
arc and exit successfully --- but the assert() that's there would
notice that something was wrong.

I concluded that the only way to fix this was to get rid of the hacks in
pull/push and instead implement a general constraint-loop-breaking pass
before we attempt pullback/pushfwd.

In principle it's okay to break a loop of constraint arcs for the same
reason cited above: traversing all the way around the loop, even if it's
possible because all the constraints are satisfied at the current point of
the string, is pointless since you're back in the same NFA state without
having consumed any input characters. Therefore we can break the loop by
removing the constraint arc(s) between any two chosen loop states, as long
as we make sure that it's still possible to follow any legal transition
sequence that would have involved less than one full trip around the loop.
This can be implemented by generating clone states that we chain onto the
predecessor state of the removed arc(s). For example, if we have a loop
involving S1, S2, S3, and we choose to remove the constraint arc from S1
to S2, we make clone states S2C and S3C; S1's former constraint outarc to
S2 now goes to S2C, and S2's constraint outarc to S3 now has a cloned copy
from S2C to S3C, but we don't clone the arc from S3 to S1 so as not to
create a new loop. All of S2's and S3's non-loop-involved outarcs are
cloned to S2C and S3C so that you can still get to any loop exit states
you could've before.

Getting this right and reasonably performant took a few tries :-(,
but I'm pretty happy with the final result, which is attached as
re-fixconstraintloops.patch.

Once we had the infinite-loop problem taken care of, some of Greg's test
cases that created larger NFAs were still taking unreasonable amounts of
time. Performance tracing showed that the problem was some algorithms
that involved O(N^2) time for states with N in-arcs or out-arcs. One
issue is that the lists chaining in-arcs and out-arcs of each state were
singly linked, meaning that freearc() took O(N) time, and deleting all the
arcs of a state in turn could take O(N^2) time if you deleted them in an
inconvenient order, which sometimes happens. The attached patch fixes
that by making those lists doubly-linked. (We'd already made the "color
chains" doubly-linked years ago; not sure why we thought we could get away
with keeping single linking for the inchains and outchains.) Another
problem area was duplicate-arc detection. For example, copyins(), which
copies all in-arcs of one state to another one, blindly passed each
proposed arc to newarc(), which then spent O(N) time checking that the
second state didn't already have an identical in-arc (as it very well
might; we can't just skip duplicate detection). So you end up with O(N^2)
work being done within the copyins() call. moveins(), copyouts(), and
moveouts() have the same problem.

The attached re-oNsquared.patch deals with these problems by teaching
those four functions to use a sort-and-merge approach when there are
more than a few source arcs, thus taking O(N log N) time not O(N^2).
This method destroys the original ordering of the arcs, but AFAICS nothing
particularly cares about that.

That patch also adds a mergeins() function that's unused as of that patch,
but is needed for the next one; it seems to belong here because it is
pretty much like copyins() except that it copies arcs that might have come
from multiple source states.

I also fixed the brain-dead O(N^2) bubble sort algorithm used in
carcsort(). That wasn't much of a bottleneck originally, but it became so
once the other O(N^2) sore spots were gone. (BTW, I wonder whether we
need carcsort() at all --- it doesn't look to me like the regex executor
particularly cares about the ordering of arcs either. But I left it in
for now, so as to avoid introducing any unexpected coupling between the
compiler and executor.)

Once we'd beaten those areas down, further testing showed that
fixempties() still took an unreasonable amount of time on larger NFAs ---
in fact, it seemed to take something like O(N^4) to clean up an N-state
loop of EMPTY arcs. I'd rewritten that once already, years ago, but
it was time to do it again. The attached re-better-fixempties.patch
gets it down to O(N^2), which I think is the best we can do. The win
comes partly from using mergeins() to amortize de-duplication of arcs
across multiple source states, and partly from exploiting knowledge of
the ordering of arcs for each state to avoid looking at arcs we don't
need to consider during the scan.

Also, we found out that the pullback/pushfwd stage was sometimes
creating lots and lots of excess states, because for each constraint arc
that it needed to move, it would generate a new intermediate state.
Constructs such as "\Y" generate lots of parallel constraint arcs between
the same two states, and if we need to push or pull all those arcs, we got
lots more states than we really needed. Patch re-better-pushpull.patch
fixes that by re-using already-created intermediate states. I also took
the trouble to redefine push() and pull() to have a less risky API: they
no longer delete any state or arc that the caller might possibly have
a pointer to, except for the specifically passed constraint arc.

Next up is re-memaccounting.patch, which replaces the previous limit
on the number of NFA states with a combined limit on the number of
states and arcs (measured by memory consumption for simplicity).
The reason we need this is that some contrived regexes involve O(N^2)
arcs despite having only O(N) states, so that it's possible to bloat the
backend to gigabytes without hitting the nfa-has-too-many-states error.
We hadn't really noticed this problem before, because the performance
problems addressed in the previous patches would ensure that you'd get
bored long before running out of memory ... but now it's possible
to get to that in a reasonable amount of time. An example is that with
the previous patches but not this one,
SELECT '' ~ repeat('x*y*z*', 2000);
takes about 10 minutes and 3GB. Without the fixempties rewrite, the
runtime would probably be many days. (BTW, do NOT try that example
without last week's cancel patches ... not only will it take forever,
but it will be uncancelable for most of that.)

With the memaccounting patch, we can fairly confidently say that regexp
compilation cannot take more than about 150MB, or half that on 32-bit
machines, no matter how crazy a pattern you throw at it. (Note: the patch
intentionally omits counting space for the "colormap"; AFAICS that can't
grow to more than a few MB given the existing limit on number of colors,
so it doesn't seem worth the extra complexity to track its space usage
explicitly.)

Lastly, re-mopup.patch includes some documentation I wrote along the way,
some more regression test cases, and some minor fixes that didn't seem to
fit into the other categories, notably removing the no-longer-needed "all"
flags for copyins() and copyouts(), and some minor improvements in the
REG_DEBUG data structure printing logic.

I propose to push these into all supported branches, and I'll also be
sending them over to the Tcl crew who will be very happy to be able to
close out some ancient bug tracker entries. If anyone wants to review
'em first, please do.

regards, tom lane

PS: patches are meant to be applied in the given order; they're not
entirely independent.

Attachment Content-Type Size
re-fixconstraintloops.patch text/x-diff 29.0 KB
re-oNsquared.patch text/x-diff 28.7 KB
re-better-fixempties.patch text/x-diff 14.0 KB
re-better-pushpull.patch text/x-diff 12.8 KB
re-memaccounting.patch text/x-diff 7.1 KB
re-mopup.patch text/x-diff 16.9 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2015-10-14 23:45:49 Re: pam auth - add rhost item
Previous Message Robert Haas 2015-10-14 23:20:45 Re: Parallel Seq Scan