Re: Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-18 18:53:05
Message-ID: 2845172.1613674385@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I thought it was worth looking a little more closely at the error
cases in this set of tests, as a form of competition analysis versus
Javascript's regex engine. I ran through the cases that gave errors,
and pinned down exactly what was causing the error for as many cases
as I could. (These results are from your first test corpus, but
I doubt the second one would give different conclusions.)

We have these errors reported in the test corpus:

error | count
-----------------------------------+-------
invalid escape \ sequence | 39141
invalid character range | 898
invalid backreference number | 816
brackets [] not balanced | 327
invalid repetition count(s) | 76
quantifier operand invalid | 17
parentheses () not balanced | 1
regular expression is too complex | 1

The existing patchset takes care of the one "regular expression is too
complex" failure. Of the rest:

It turns out that almost 39000 of the "invalid escape \ sequence"
errors are due to use of \D, \S, or \W within a character class.
We support the positive-class shorthands \d, \s, \w there, but not
their negations. I think that this might be something that Henry
Spencer just never got around to; I don't see any fundamental reason
we can't allow it, although some refactoring might be needed in the
regex lexer. Given the apparent popularity of this notation, maybe
we should put some work into that.

(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)

About half of the remaining escape-sequence complaints seem to be due
to just randomly backslashing alphanumeric characters that don't need
it, as for example "i" in "\itunes\.apple\.com". Apparently
Javascript is content to take "\i" as just meaning "i". Our engine
rejects that, with a view to keeping such combinations reserved for
future definition. That's fine by me so I don't want to change it.

Of the rest, many are abbreviated numeric escapes, eg "\u45" where our
engine wants to see "\u0045". I don't think being laxer about that
would be a great idea either.

Lastly, there are some occurrences like "[\1]", which in context look
like the \1 might be intended as a back-reference? But I don't really
understand what that's supposed to do inside a bracket expression.

The "invalid character range" errors seem to be coming from constructs
like "[A-Za-z0-9-/]", which our engine rejects because it looks like
a messed-up character range.

All but 123 of the "invalid backreference number" complaints stem from
using backrefs inside lookahead constraints. Some of the rest look
like they think you can put capturing parens inside a lookahead
constraint and then backref that. I'm not really convinced that such
constructs have a well-defined meaning. (I looked at the ECMAscript
definition of regexes, and they do say it's allowed, but when trying
to define it they resort to handwaving about backtracking; at best that
is a particularly lame version of specification by implementation.)
Spencer chose to forbid these cases in our engine, and I think there
are very good implementation reasons why it won't work. Perhaps we
could provide a clearer error message about it, though.

307 of the "brackets [] not balanced" errors, as well as the one
"parentheses () not balanced" error, seem to trace to the fact that
Javascript considers "[]" to be a legal empty character class, whereas
POSIX doesn't allow empty character classes so our engine takes the
"]" literally, and then looks for a right bracket it won't find.
(That is, in POSIX "[]x]" is a character class matching ']' and 'x'.)
Maybe I'm misinterpreting this too, because if I read the
documentation correctly, "[]" in Javascript matches nothing, making
it impossible for the regex to succeed. Why would such a construct
appear this often?

The remainder of the bracket errors happen because in POSIX, the
sequences "[:", "[=", and "[." within a bracket expression introduce
special syntax, whereas in Javascript '[' is just an ordinary data
character within a bracket expression. Not much we can do here; the
standards are just incompatible.

All but 3 of the "invalid repetition count(s)" errors come from
quantifiers larger than our implementation limit of 255. A lot of
those are exactly 256, though I saw one as high as 3000. The
remaining 3 errors are from syntax like "[0-9]{0-3}", which is a
syntax error according to our engine ("[0-9]{0,3}" is correct).
AFAICT it's not a valid quantifier according to Javascript either;
perhaps that engine is just taking the "{0-3}" as literal text?

Given this, it seems like there's a fairly strong case for increasing
our repetition-count implementation limit, at least to 256, and maybe
1000 or so. I'm hesitant to make the limit *really* large, but if
we can handle a regex containing thousands of "x"'s, it's not clear
why you shouldn't be able to write that as "x{0,1000}".

All of the "quantifier operand invalid" errors come from these
three patterns:
((?!\\)?\{0(?!\\)?\})
((?!\\)?\{1(?!\\)?\})
class="(?!(tco-hidden|tco-display|tco-ellipsis))+.*?"|data-query-source=".*?"|dir=".*?"|rel=".*?"
which are evidently trying to apply a quantifier to a lookahead
constraint, which is just silly.

In short, a lot of this is from incompatible standards, or maybe
from varying ideas about whether to throw an error for invalid
constructs. But I see a couple things we could improve.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amichai Amar 2021-02-18 19:13:34 many-to-many problem
Previous Message Peter Eisentraut 2021-02-18 18:14:30 Re: cursor sensitivity misunderstanding