Re: Another regexp performance improvement: skip useless paren-captures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject: Re: Another regexp performance improvement: skip useless paren-captures
Date: 2021-08-10 01:11:14
Message-ID: 3730031.1628557874@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
>> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> There is a potentially interesting definitional question:
>> what exactly ought this regexp do?
>> ((.)){0}\2
>> Because the capturing paren sets are zero-quantified, they will
>> never be matched to any characters, so the backref can never
>> have any defined referent.

> Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we should be clear what that is:

> if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
> {
> print "captured 1 $1\n" if defined $1;
> print "captured 2 $2\n" if defined $2;
> print "captured 3 $3\n" if defined $3;
> print "captured 4 $4\n" if defined $4;
> print "match = $match\n" if defined $match;
> }

Hm. I'm not sure that this example proves anything about Perl's handling
of the situation, since you didn't use a backref. I tried both

if ('foo' =~ m/((.)){0}\1/)

if ('foo' =~ m/((.)){0}\2/)

and while neither throws an error, they don't succeed either.
So AFAICS Perl is acting in the way I'm attributing to POSIX.
But maybe we should actually read POSIX ...

>> ... I guess Spencer did think about this to some extent -- he
>> just forgot about the possibility of nested parens.

> Ugg. That means our code throws an error where perl does not, pretty
> well negating my point above. If we're already throwing an error for
> this type of thing, I agree we should be consistent about it. My
> personal preference would have been to do the same thing as perl, but it
> seems that ship has already sailed.

Removing an error case is usually an easier sell than adding one.
However, the fact that the simplest case (viz, '(.){0}\1') has always
thrown an error and nobody has complained in twenty-ish years suggests
that nobody much cares.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-08-10 01:17:40 Re: Another regexp performance improvement: skip useless paren-captures
Previous Message Masahiko Sawada 2021-08-10 01:01:02 Re: Small documentation improvement for ALTER SUBSCRIPTION