Re: Some regular-expression performance hacking

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-18 11:04:55
Message-ID: 12e6381c-adc1-4711-a619-864fb8df969c@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:
>SELECT * FROM vdeviations;
>-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
>pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$

Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.

I noticed it when trying to read from the vdeviations view in PostgreSQL 13.2.

Here is my little helper-function which I used to shrink patterns/subjects longer than N characters:

CREATE OR REPLACE FUNCTION shrink_text(text,integer) RETURNS text LANGUAGE sql AS $$
SELECT CASE WHEN length($1) < $2 THEN $1 ELSE
format('%s ... %s chars ... %s', m[1], length(m[2]), m[3])
END
FROM (
SELECT regexp_matches($1,format('^(.{1,%1$s})(.*?)(.{1,%1$s})$',$2/2)) AS m
) AS q
$$;

The regex aims to produce three capture groups,
where I wanted the first and third ones to be greedy
and match up to $2 characters (controlled by the second input param to the function),
and the second capture group in the middle to be non-greedy,
but match the remainder to make up a fully anchored match.

It works like expected in both 13.2 and HEAD+patches, but the speed-up it enormous:

PostgreSQL 13.2:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
QUERY PLAN
-------------------------------------------------------------------------------------------------
ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=23600.816..23600.838 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
Planning Time: 0.432 ms
Execution Time: 23600.859 ms
(4 rows)

HEAD+0001+0002+0003+0004+0005:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
QUERY PLAN
-------------------------------------------------------------------------------------------
ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=36.656..36.661 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.002 rows=1 loops=1)
Planning Time: 0.575 ms
Execution Time: 36.689 ms
(4 rows)

Cool stuff.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-02-18 11:13:01 Re: 64-bit XIDs in deleted nbtree pages
Previous Message Greg Nancarrow 2021-02-18 10:39:08 Re: Parallel INSERT (INTO ... SELECT ...)