Re: Some regular-expression performance hacking

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-15 08:21:21
Message-ID: 99f6cbe0-6e58-4e41-add0-977271a5cf84@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:
>I got these runtimes (non-cassert builds):
>
>HEAD 313661.149 ms (05:13.661)
>+0001 297397.293 ms (04:57.397) 5% better than HEAD
>+0002 151995.803 ms (02:31.996) 51% better than HEAD
>+0003 139843.934 ms (02:19.844) 55% better than HEAD
>+0004 95034.611 ms (01:35.035) 69% better than HEAD
>
>Since I don't have all the tables used in your query, I can't
>try to reproduce your results exactly. I suspect the reason
>I'm getting a better percentage improvement than you did is
>that the joining/grouping/ordering involved in your query
>creates a higher baseline query cost.

Mind blowing speed-up, wow!

I've tested all 4 patches successfully.

To eliminate the baseline cost of the join,
I first created this table:

CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
;

Then I ran this query:

\timing

SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;

All patches gave the same result:

?column? | ?column? | count
----------+----------+---------
f | f | 1448212
(1 row)

I.e., no detected semantic differences.

Timing differences:

HEAD 570632.722 ms (09:30.633)
+0001 472938.857 ms (07:52.939) 17% better than HEAD
+0002 451638.049 ms (07:31.638) 20% better than HEAD
+0003 439377.813 ms (07:19.378) 23% better than HEAD
+0004 96447.038 ms (01:36.447) 83% better than HEAD

I tested on my MacBook Pro 2.4GHz 8-Core Intel Core i9, 32 GB 2400 MHz DDR4 running macOS Big Sur 11.1:

SELECT version();
version
----------------------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-apple-darwin20.2.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
(1 row)

My HEAD = 46d6e5f567906389c31c4fb3a2653da1885c18ee.

PostgreSQL was compiled with just ./configure, no parameters, and the only non-default postgresql.conf settings were these:
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql.log'

Amazing work!

I hope to have a new dataset ready soon with regex flags for applied subjects as well.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-02-15 08:27:21 Re: ERROR: invalid spinlock number: 0
Previous Message Greg Nancarrow 2021-02-15 07:39:36 Re: Parallel INSERT (INTO ... SELECT ...)