Re: Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-17 21:00:36
Message-ID: 2752386.1613595636@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joel Jacobson" <joel(at)compiler(dot)org> writes:
> I've tested all 4 patches successfully.

Thanks!

I found one other area that could be improved with the same idea of
getting rid of unnecessary subre's: right now, every pair of capturing
parentheses gives rise to a "capture" subre with an NFA identical to
its single child subre (which is what does the actual matching work).
While this doesn't really add any runtime cost, the duplicate NFA
definitely does add to the compilation cost, since we run it through
optimization independently of the child.

I initially thought that we could just flush capture subres altogether
in favor of annotating their children with a "we need to capture this
result" marker. However, Spencer's regression tests soon exposed the
flaw in that notion. It's legal to write "((x))" or even "((((x))))",
so that there can be multiple sets of capturing parentheses with a
single child node. The solution adopted in the attached 0005 is to
handle the innermost capture with a marker on the child subre, but if
we need an additional capture on a node that's already marked, put
a capture subre on top just like before. One could instead complicate
the data structure by allowing N capture markers on a single subre
node, but I judged that not to be a good tradeoff. I don't see any
reason except spec compliance to allow such equivalent capture groups,
so I don't care if they're a bit inefficient. (If anyone knows of a
useful application for writing REs like this, we could reconsider that
choice.)

One small issue with marking the child directly is that we can't get
away any longer with overlaying capture and backref subexpression
numbers, since you could theoretically write (\1) which'd result in
needing to put a capture label on a backref subre. This could again
have been handled by making the capture a separate node, but I really
don't much care for the way that subre.subno has been overloaded for
three(!) different purposes depending on node type. So I just broke
it into three separate fields. Maybe the incremental cost of the
larger subre struct was worth worrying about back in 1997 ... but
I kind of doubt that it was a useful micro-optimization even then,
considering the additional NFA baggage that every subre carries.

Also, I widened "subre.id" from short to int, since the narrower field
no longer saves anything given the new struct layout. The existing
choice was dubious already, because every other use of subre ID
numbers was int or even size_t, and there was nothing checking for
overflow of the id fields. (Although perhaps it doesn't matter,
since I'm unsure that the id fields are really used for anything
except debugging purposes.)

For me, 0005 makes a fairly perceptible difference on your test case
subject_id = 611875, which I've been paying attention to because it's
the one that failed with "regular expression is too complex" before.
I see about a 20% time savings from 0004 on that case, but not really
any noticeable difference in the total runtime for the whole suite.
So I think we're getting to the point of diminishing returns for
this concept (another reason for not chasing after optimization of
the duplicate-captures case). Still, we're clearly way ahead of
where we started.

Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements. I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.

regards, tom lane

Attachment Content-Type Size
0001-invent-rainbow-arcs-4.patch text/x-diff 21.1 KB
0002-recognize-matchall-NFAs-4.patch text/x-diff 16.6 KB
0003-remove-useless-concat-nodes-4.patch text/x-diff 3.6 KB
0004-make-subre-trees-Nary-4.patch text/x-diff 23.7 KB
0005-remove-separate-capture-nodes-4.patch text/x-diff 9.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2021-02-17 21:19:35 Re: Support for NSS as a libpq TLS backend
Previous Message Robert Haas 2021-02-17 20:56:24 Re: new heapcheck contrib module