Re: Some regular-expression performance hacking

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-20 09:19:04
Message-ID: 7c8faebf-a9c3-4144-88cd-7ff81a1e8763@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote:
>"Joel Jacobson" <joel(at)compiler(dot)org> writes:
>> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>>> (Having said that, I can't help noticing that a very large fraction
>>> of those usages look like, eg, "[\w\W]". It seems to me that that's
>>> a very expensive and unwieldy way to spell ".". Am I missing
>>> something about what that does in Javascript?)
>
>> I think this is a non-POSIX hack to match any character, including newlines,
>> which are not included unless the "s" flag is set.
>
>> "foo\nbar".match(/([\w\W]+)/)[1];
>> "foo
>> bar"
>
>Oooh, that's very interesting. I guess the advantage of that over using
>the 's' flag is that you can have different behaviors at different places
>in the same regex.

I would guess the same thing.

>I was just wondering about this last night in fact, while hacking on
>the code to get it to accept \W etc in bracket expressions. I see that
>right now, our code thinks that NLSTOP mode ('n' switch, the opposite
>of 's') should cause \W \D \S to not match newline. That seems a little
>weird, not least because \S should probably be different from the other
>two, and it isn't. And now we see it'd mean that you couldn't use the 'n'
>switch to duplicate Javascript's default behavior in this area. Should we
>change it? (I wonder what Perl does.)
>
>regards, tom lane

To allow comparing PostgreSQL vs Javascript vs Perl,
I installed three helper-functions using plv8 and plperl,
and also one convenience function for PostgreSQL
to catch errors and return the error string instead:

The string used in this test is "foo!\n!bar",
which aims to detect differences in how new-lines
and non alpha-number characters are handled.

To allow PostgreSQL to be compared with Javascript and Perl,
the "n" flag is used for PostgreSQL when no flags are used for Javascript/Perl,
and no flag for PostgreSQL when the "s" flag is used for Javascript/Perl,
for the results to be comparable.

In Javascript, when a regex contains capture groups, the entire match
is always returns as the first array element.
To make it easier to visually compare the results,
the first element is removed from Javascript,
which works in this test since all regexes contain
exactly one capture group.

Here are the results:

$ psql -e -f not_alnum.sql regex

SELECT
regexp_match_pg(E'foo!\n!bar', '(.+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(.+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(.+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo!} | {foo!} | {foo!}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(.+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(.+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(.+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{"foo! +| {"foo! +| {"foo! +
!bar"} | !bar"} | !bar"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w\W]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\w\W]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w\W]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"foo! +| {"foo! +
| !bar"} | !bar"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w\W]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\w\W]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w\W]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"foo! +| {"foo! +
| !bar"} | !bar"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\w]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\w]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\W]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\W]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\W]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"! +| {"! +
| !"} | !"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\W]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\W]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\W]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"! +| {"! +
| !"} | !"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\w+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(\w+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(\w+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\w+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(\w+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(\w+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\W+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(\W+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(\W+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{!} | {"! +| {"! +
| !"} | !"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\W+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(\W+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(\W+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{"! +| {"! +| {"! +
!"} | !"} | !"}
(1 row)

/Joel

Attachment Content-Type Size
not_alnum.sql application/octet-stream 2.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael J. Baars 2021-02-20 09:27:20 computing dT from an interval
Previous Message Dilip Kumar 2021-02-20 09:15:09 Re: [HACKERS] Custom compression methods