Silliness in regexp's citerdissect/creviterdissect

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Silliness in regexp's citerdissect/creviterdissect
Date: 2021-08-19 22:31:09
Message-ID: 1808998.1629412269@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While poking at a report of slow regexp matching [1], I happened
to notice that citerdissect and creviterdissect are being remarkably
stupid about how to backtrack after a match failure. Specifically,
having used the DFA logic to identify K possible submatches, they
then start the slow recursive "cdissect" check of each submatch.
But, if the I'th submatch fails dissection, their response is to
see about adjusting the length of the last submatch ... which will
do nothing at all to make the result of the I'th submatch change,
if it's before K. When this happens, we should immediately
start adjusting the length of the I'th submatch.

Attached is a simple patch to do this. It passes check-world, and
shows the same results as before on Jacobson's web-regexps corpus,
as well as on a sample of random regexps made by Dilger's script.
I don't really see any performance difference on Jacobson's corpus,
but Dilger's script did find an example where this makes a huge
difference:

HEAD:

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}');
regexp_match
--------------

(1 row)

Time: 9655.141 ms (00:09.655)

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}');
regexp_match
--------------
{x,x,x}
(1 row)

Time: 271106.324 ms (04:31.106)

With patch:

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}');
regexp_match
--------------

(1 row)

Time: 9.385 ms

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}');
regexp_match
--------------
{x,x,x}
(1 row)

Time: 25.103 ms

Admittedly this is a bit contrived. (In v14/HEAD, you can make the
performance problem go away by getting rid of the redundant capturing
parens, as then we don't invoke citerdissect at all. That trick doesn't
help in the back branches though.)

I think this is a pretty clear performance bug, and unless there are
objections I plan to push this fix into all branches.

regards, tom lane

[1] https://www.postgresql.org/message-id/CALZg0g4FA1Xc5UMLrGBKM--erUGEAhe8GGLE-YcN7%3DO6rw2%3D0A%40mail.gmail.com

Attachment Content-Type Size
backtrack-more-effectively-in-citerdissect.patch text/x-diff 2.3 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message David Christensen 2021-08-19 22:43:55 Re: [PATCH] Proof of concept for GUC improvements
Previous Message Vik Fearing 2021-08-19 22:30:31 Re: [PATCH] Proof of concept for GUC improvements