Re: Some regular-expression performance hacking

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-13 17:19:34
Message-ID: 85140bfb-516c-449b-8a03-7930abb42b88@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:
>0001-invent-rainbow-arcs.patch
>0002-recognize-matchall-NFAs.patch

Many thanks for working on the regex engine,
this looks like an awesome optimization.

To test the correctness of the patches,
I thought it would be nice with some real-life regexes,
and just as important, some real-life text strings,
to which the real-life regexes are applied to.

I therefore patched Chromium's v8 regexes engine,
to log the actual regexes that get compiled when
visiting websites, and also the text strings that
are the regexes are applied to during run-time
when the regexes are executed.

I logged the regex and text strings as base64 encoded
strings to STDOUT, to make it easy to grep out the data,
so it could be imported into PostgreSQL for analytics.

In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.

Here are some statistics on the different flags used:

SELECT *, SUM(COUNT) OVER () FROM (SELECT flags, COUNT(*) FROM patterns GROUP BY flags) AS x ORDER BY COUNT DESC;
flags | count | sum
-------+--------+--------
| 150097 | 235204
i | 43537 | 235204
g | 22029 | 235204
gi | 15416 | 235204
gm | 2411 | 235204
gim | 602 | 235204
m | 548 | 235204
im | 230 | 235204
y | 193 | 235204
gy | 60 | 235204
giy | 29 | 235204
giu | 26 | 235204
u | 11 | 235204
iy | 6 | 235204
gu | 5 | 235204
gimu | 2 | 235204
iu | 1 | 235204
my | 1 | 235204
(18 rows)

As we can see, no flag at all is the most common, followed by the "i" flag.

Most of the Javascript-regexes (97%) could be understood by PostgreSQL,
only 3% produced some kind of error, which is not unexpected,
since some Javascript-regex features like \w and \W have different
syntax in PostgreSQL:

SELECT *, SUM(COUNT) OVER () FROM (SELECT is_match,error,COUNT(*) FROM subjects GROUP BY is_match,error) AS x ORDER BY count DESC;
is_match | error | count | sum
----------+---------------------------------------------------------------+--------+---------
f | | 973987 | 1489489
t | | 474225 | 1489489
| invalid regular expression: invalid escape \ sequence | 39141 | 1489489
| invalid regular expression: invalid character range | 898 | 1489489
| invalid regular expression: invalid backreference number | 816 | 1489489
| invalid regular expression: brackets [] not balanced | 327 | 1489489
| invalid regular expression: invalid repetition count(s) | 76 | 1489489
| invalid regular expression: quantifier operand invalid | 17 | 1489489
| invalid regular expression: parentheses () not balanced | 1 | 1489489
| invalid regular expression: regular expression is too complex | 1 | 1489489
(10 rows)

Having had some fun looking at statistics, let's move on to look at if there are any
observable differences between HEAD (8063d0f6f56e53edd991f53aadc8cb7f8d3fdd8f)
and when these two patches have been applied.

To detect any differences,
for each (regex pattern, text string subject) pair,
the columns,
is_match boolean
captured text[]
error text
were set by a PL/pgSQL function running HEAD:

BEGIN
_is_match := _subject ~ _pattern;
_captured := regexp_match(_subject, _pattern);
EXCEPTION WHEN OTHERS THEN
UPDATE subjects SET
error = SQLERRM
WHERE subject_id = _subject_id;
CONTINUE;
END;
UPDATE subjects SET
is_match = _is_match,
captured = _captured
WHERE subject_id = _subject_id;

The patches

0001-invent-rainbow-arcs.patch
0002-recognize-matchall-NFAs.patch

were then applied and this query was executed to spot any differences:

SELECT
is_match <> (subject ~ pattern) AS is_match_diff,
captured IS DISTINCT FROM regexp_match(subject, pattern) AS captured_diff,
COUNT(*)
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern))
GROUP BY 1,2
ORDER BY 3 DESC
;

The query was first run on the unpatched HEAD to verify it detects no results.
0 rows indeed, and it took this long to finish the query:

Time: 186077.866 ms (03:06.078)

Running the same query with the two patches, was significantly faster:

Time: 111785.735 ms (01:51.786)

No is_match differences were detected, good!

However, there were 23 cases where what got captured differed:

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject | v-cloak
is_match_head | t
captured_head | {cloak,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-cloak}
-[ RECORD 2 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject | v-if
is_match_head | t
captured_head | {if,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-if}
-[ RECORD 3 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?a5oc.com).*
subject | https://a5oc.com/attachments/6b184e79-6a7f-43e0-ac59-7ed9d0a8eb7e-jpeg.179582/
is_match_head | t
captured_head | {https://,a5oc.com,NULL <https://%2Ca5oc.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 4 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?allfordmustangs.com).*
subject | https://allfordmustangs.com/attachments/e463e329-0397-4e13-ad41-f30c6bc0659e-jpeg.779299/
is_match_head | t
captured_head | {https://,allfordmustangs.com,NULL <https://%2Callfordmustangs.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 5 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?audi-forums.com).*
subject | https://audi-forums.com/attachments/screenshot_20210207-151100_ebay-jpg.11506/
is_match_head | t
captured_head | {https://,audi-forums.com,NULL <https://%2Caudi-forums.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 6 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?can-amforum.com).*
subject | https://can-amforum.com/attachments/resized_20201214_163325-jpeg.101395/
is_match_head | t
captured_head | {https://,can-amforum.com,NULL <https://%2Ccan-amforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 7 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?contractortalk.com).*
subject | https://contractortalk.com/attachments/maryann-porch-roof-quote-12feb2021-jpg.508976/
is_match_head | t
captured_head | {https://,contractortalk.com,NULL <https://%2Ccontractortalk.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 8 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?halloweenforum.com).*
subject | https://halloweenforum.com/attachments/dead-fred-head-before-and-after-jpg.744080/
is_match_head | t
captured_head | {https://,halloweenforum.com,NULL <https://%2Challoweenforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 9 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?horseforum.com).*
subject | https://horseforum.com/attachments/dd90f089-9ae9-4521-98cd-27bda9ad38e9-jpeg.1109329/
is_match_head | t
captured_head | {https://,horseforum.com,NULL <https://%2Chorseforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 10 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?passatworld.com).*
subject | https://passatworld.com/attachments/clean-passat-jpg.102337/
is_match_head | t
captured_head | {https://,passatworld.com,NULL <https://%2Cpassatworld.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 11 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?plantedtank.net).*
subject | https://plantedtank.net/attachments/brendon-60p-jpg.1026075/
is_match_head | t
captured_head | {https://,plantedtank.net,NULL <https://%2Cplantedtank.net%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 12 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?vauxhallownersnetwork.co.uk).*
subject | https://vauxhallownersnetwork.co.uk/attachments/opelnavi-jpg.96639/
is_match_head | t
captured_head | {https://,vauxhallownersnetwork.co.uk,NULL <https://%2Cvauxhallownersnetwork.co.uk%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 13 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?volvov40club.com).*
subject | https://volvov40club.com/attachments/img_20210204_164157-jpg.17356/
is_match_head | t
captured_head | {https://,volvov40club.com,NULL <https://%2Cvolvov40club.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 14 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?vwidtalk.com).*
subject | https://vwidtalk.com/attachments/1613139846689-png.1469/
is_match_head | t
captured_head | {https://,vwidtalk.com,NULL <https://%2Cvwidtalk.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 15 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?yellowbullet.com).*
subject | https://yellowbullet.com/attachments/20210211_133934-jpg.204604/
is_match_head | t
captured_head | {https://,yellowbullet.com,NULL <https://%2Cyellowbullet.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 16 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[^\?]*)?(\?[^#]*)?(#.*$)?
subject | https://www.disneyonice.com/oneIdResponder.html
is_match_head | t
captured_head | {https://www.disneyonice.com/oneIdResponder.html,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 17 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject | /
is_match_head | t
captured_head | {/,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 18 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject | /en.html
is_match_head | t
captured_head | {/en,.html}
is_match_patch | t
captured_patch |
-[ RECORD 19 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^https?:\/\/)?(((\[[^\]]+\])|([^:\/\?#]+))(:(\d+))?)?([^\?#]*)(.*)?
subject | https://e.echatsoft.com/mychat/visitor
is_match_head | t
captured_head | {https://,e.echatsoft.com,e.echatsoft.com,NULL,e.echatsoft.com,NULL,NULL,/mychat/visitor <https://%2Ce.echatsoft.com%2Ce.echatsoft.com%2Cnull%2Ce.echatsoft.com%2Cnull%2Cnull%2C/mychat/visitor>,""}
is_match_patch | t
captured_patch | {NULL,https,https,NULL,https,NULL,NULL,://e.echatsoft.com/mychat/visitor,""}
-[ RECORD 20 ]-+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
pattern | (^|.)41nbc.com$|(^|.)41nbc.dev$|(^|.)52.23.179.12$|(^|.)52.3.245.221$|(^|.)clipsyndicate.com$|(^|.)michaelbgiordano.com$|(^|.)syndicaster.tv$|(^|.)wdef.com$|(^|.)wdef.dev$|(^|.)wxxv.mysiteserver.net$|(^|.)wxxv25.dev$|(^|.)clipsyndicate.com$|(^|.)syndicaster.tv$
subject | wdef.com
is_match_head | t
captured_head | {NULL,NULL,NULL,NULL,NULL,NULL,NULL,"",NULL,NULL,NULL,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 21 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^((^\w+:|^)\/\/)?(?:www\.)?
subject | https://www.deputy.com/
is_match_head | t
captured_head | {https://,https <https://%2Chttps/>:}
is_match_patch | t
captured_patch |
-[ RECORD 22 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^((^\w+:|^)\/\/)?(?:www\.)?
subject | https://www.westernsydney.edu.au/
is_match_head | t
captured_head | {https://,https <https://%2Chttps/>:}
is_match_patch | t
captured_patch |
-[ RECORD 23 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^(https?:){0,1}\/\/|
subject | https://ui.powerreviews.com/api/
is_match_head | t
captured_head | {https:}
is_match_patch | t
captured_patch | {NULL}

The code to reproduce the results have been pushed here:
https://github.com/truthly/regexes-in-the-wild

Let me know if you want access to the dataset,
I could open up a port to my PostgreSQL so you could take a dump.

SELECT
pg_size_pretty(pg_relation_size('patterns')) AS patterns,
pg_size_pretty(pg_relation_size('subjects')) AS subjects;
patterns | subjects
----------+----------
20 MB | 568 MB
(1 row)

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-13 17:35:45 Re: Some regular-expression performance hacking
Previous Message Andres Freund 2021-02-13 16:53:23 Re: repeated decoding of prepared transactions