Re: Another regexp performance improvement: skip useless paren-captures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject: Re: Another regexp performance improvement: skip useless paren-captures
Date: 2021-08-09 23:31:41
Message-ID: 3643732.1628551901@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
> +server closed the connection unexpectedly

Here's a quick draft patch for this. Basically it moves the
responsibility for clearing v->subs[] pointers into the freesubre()
recursion, so that it will happen for contained capturing parens
not only the top level.

There is a potentially interesting definitional question:
what exactly ought this regexp do?

((.)){0}\2

Because the capturing paren sets are zero-quantified, they will
never be matched to any characters, so the backref can never
have any defined referent. I suspect that study of the POSIX
spec would lead to the conclusion that this is a legal regexp
but it will never match anything. Implementing that would be
tedious though, and what's more it seems very unlikely that
the user wanted any such behavior. So I think throwing an
error is an appropriate response. The existing code will
throw such an error for

((.)){0}\1

so I guess Spencer did think about this to some extent -- he
just forgot about the possibility of nested parens.

This patch should work OK in HEAD and v14, but it will need
a bit of fooling-about for older branches I think, given that
they fill v->subs[] a little differently.

regards, tom lane

Attachment Content-Type Size
fix-zero-quantified-nested-parens.patch text/x-diff 3.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-08-09 23:37:18 Re: Another regexp performance improvement: skip useless paren-captures
Previous Message Peter Geoghegan 2021-08-09 23:20:05 Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE