Re: BUG #6381: Incorrect greediness behavior in certain regular expressions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: code(at)phaedrusdeinus(dot)org
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6381: Incorrect greediness behavior in certain regular expressions
Date: 2012-01-06 07:21:19
Message-ID: 3498.1325834479@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

code(at)phaedrusdeinus(dot)org writes:
> This, more complex/complete version, matches greedily, which is incorrect:

> =# select regexp_matches('http://blahblah.com/asdf',
> 'http(s?)(:|%3a)(//|%2f%2f)(.*?)(/|%2f|$)');
> regexp_matches
> --------------------------------
> {"",:,//,blahblah.com/asdf,""}

I do not believe this is a bug; the RE code appears to me to be
following its specification, as per the detailed rules in section
9.7.3.5:
http://www.postgresql.org/docs/9.1/static/functions-matching.html#POSIX-MATCHING-RULES

Specifically, the regex as a whole is considered greedy because the
first subpart with a greediness attribute is the (s?) piece, and the ?
quantifier is greedy. Therefore, the regex as a whole matches the
longest possible string, which in this case will be up to the end of the
input. It's true that the (.*?) subpart is non-greedy, but that only
affects how much of the overall match length can get assigned to that
subpart relative to other subparts, and in this case there is no
flexibility to assign it more or less of the match once the total match
length is determined.

The reason your second case "works" as you expect is that there's no way
for it to match anything beyond the last slash:

> select regexp_matches('http://blahblah.com/asdf',
> 'http(s?)(:|%3a)(//|%2f%2f)(.*?)(/|%2f)');
> regexp_matches
> --------------------------
> {"",:,//,blahblah.com,/}

There's only one possible match here, independently of whether the (.*?)
portion is greedy or not.

With this example, you could get the results you're after by using
the non-greedy ?? operator for the first subpart:

regression=# select regexp_matches('http://blahblah.com/asdf', 'http(s??)(:|%3a)(//|%2f%2f)(.*?)(/|%2f|$)');
regexp_matches
--------------------------
{"",:,//,blahblah.com,/}

although I cannot tell whether that generalizes to your real problem.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message koizumistr 2012-01-06 14:58:18 BUG #6384: pg_types_date.h does not exist
Previous Message code 2012-01-06 00:32:17 BUG #6381: Incorrect greediness behavior in certain regular expressions