Re: Regex performance regression induced by match-all code

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Regex performance regression induced by match-all code
Date: 2021-05-02 16:53:16
Message-ID: 3709847.1619974396@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joel Jacobson" <joel(at)compiler(dot)org> writes:
> On Sat, May 1, 2021, at 21:46, Tom Lane wrote:
>> I found a nasty performance problem in commit 824bf7190: given the
>> right sort of regex, checkmatchall() takes an unreasonable amount
>> of time.

> Nice catch.
> I've successfully tested the patch on the regex corpus:

Thanks for testing!

>> The main thing I find a bit ugly about this solution is that
>> I'm abusing the state->tmp fields (which are declared struct state *)
>> to hold the memoization results (which are "bool *"). It'd be
>> possible to avoid that by allocating an additional "bool **" array
>> indexed by state number, but whether it's worth that depends on how
>> allergic you are to weird pointer casts.

I tried rewriting it like that, and I have to say I do like it better
that way. I think it's clearer, which seems well worth one extra
malloc.

Also, I poked a little more at the question of the heuristic limit
on number of states, by checking the actual numbers of states in
various ways of writing matchall regexes. It looks to me like
we can cut the limit to DUPINF*2 and still have lots of daylight,
because reasonable (and even not so reasonable) ways to write a
pattern that can match up to K characters seem to come out with
not much more than K states.

Hence, PFA v2. I also added a couple of test cases based on
looking at code coverage in this area, as well as a case that
takes an unreasonably long time without this fix.

regards, tom lane

Attachment Content-Type Size
fix-exponential-cost-of-checkmatchall-2.patch text/x-diff 21.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2021-05-02 17:45:18 Re: websearch_to_tsquery() returns queries that don't match to_tsvector()
Previous Message vignesh C 2021-05-02 16:34:27 Re: Identify missing publications from publisher while create/alter subscription.