Re: Some regular-expression performance hacking

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Chapman Flack" <chap(at)anastigmatix(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-26 16:42:32
Message-ID: 0e3addb8-415a-4e94-8294-c7ab86ad46d0@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:
> 0007-smarter-regex-allocation-2.patch

I've successfully tested this patch.

I had to re-create the performance_test table
since some cases the previously didn't give an error,
now gives error "invalid regular expression: invalid character range".
This is expected and of course an improvement,
but just wanted to explain why the number of rows
don't match the previous test runs.

CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
patterns.flags,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
--
-- the below part is added to ignore cases
-- that now results in error:
--
AND NOT EXISTS (
SELECT 1 FROM deviations
WHERE deviations.test_id = tests.test_id
AND deviations.error IS NOT NULL
);
SELECT 3253889

Comparing 13.2 with HEAD,
not a single test resulted in a different is_match value,
i.e. the test just using the ~ regex operator,
to only check if it matches or not. Good.

SELECT COUNT(*)
FROM deviations
JOIN tests ON tests.test_id = deviations.test_id
WHERE tests.is_match <> deviations.is_match

count
-------
0
(1 row)

The below query shows a frequency count per error message:

SELECT error, COUNT(*)
FROM deviations
GROUP BY 1
ORDER BY 2 DESC

error | count
-----------------------------------------------------+--------
| 106173
regexp_match() does not support the "global" option | 5799
invalid regular expression: invalid character range | 1060
invalid regular expression option: "y" | 277
(4 rows)

As we can see, 106173 cases now goes through without an error,
that previously gave an error. This is thanks to now allowing escape
sequences within bracket expressions.

The other errors are expected and all good.

End of correctness analysis. Now let's look at performance!
I reran the same query three times to get a feeling for the stddev.

\timing

SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern, flags),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2;

?column? | ?column? | count
----------+----------+---------
f | f | 3253889
(1 row)

HEAD (b3a9e9897ec702d56602b26a8cdc0950f23b29dc)
Time: 125938.747 ms (02:05.939)
Time: 125414.792 ms (02:05.415)
Time: 126185.496 ms (02:06.185)

HEAD (b3a9e9897ec702d56602b26a8cdc0950f23b29dc)+0007-smarter-regex-allocation-2.patch

?column? | ?column? | count
----------+----------+---------
f | f | 3253889
(1 row)

Time: 89145.030 ms (01:29.145)
Time: 89083.210 ms (01:29.083)
Time: 89166.442 ms (01:29.166)

That's a 29% speed-up compared to HEAD! Truly amazing.

Let's have a look at the total speed-up compared to PostgreSQL 13.

In my previous benchmarks testing against old versions,
I used precompiled binaries, but this time I compiled REL_13_STABLE:

Time: 483390.132 ms (08:03.390)

That's a 82% speed-up in total! Amazing!

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-02-26 17:55:16 Re: SSL SNI
Previous Message Tom Lane 2021-02-26 15:11:57 Re: Postgresql network transmission overhead