Re: Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-18 18:10:47
Message-ID: 2840118.1613671847@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joel Jacobson" <joel(at)compiler(dot)org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.

Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well. Even discounting
that, I'm not getting quite the same results, and I don't understand
why not. So how was that made from the raw "patterns" and "subjects"
tables?

> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)
>
> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD. I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan). Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one. That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-02-18 18:14:30 Re: cursor sensitivity misunderstanding
Previous Message Pavel Stehule 2021-02-18 17:25:53 Re: Problem with accessing TOAST data in stored procedures