Re: regexp_matches() quantified-capturing-parentheses oddity

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Harald Fuchs <hari(dot)fuchs(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Julian Mehnle <julian(at)mehnle(dot)net>
Subject: Re: regexp_matches() quantified-capturing-parentheses oddity
Date: 2009-12-09 03:04:54
Message-ID: 20201.1260327894@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Harald Fuchs <hari(dot)fuchs(at)gmail(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Julian Mehnle <julian(at)mehnle(dot)net> writes:
>>> So far, so good. However, can someone please explain the following to me?
>>> wisu-dev=# SELECT regexp_matches('quux(at)foo@bar.zip', '([(at)(dot)]|[^(at)(dot)]+)+', 'g');
>>> wisu-dev=# SELECT regexp_matches('quux(at)foo@bar.zip', '([(at)(dot)]|[^(at)(dot)]+){1,2}', 'g');
>>> wisu-dev=# SELECT regexp_matches('quux(at)foo@bar.zip', '([(at)(dot)]|[^(at)(dot)]+){1,3}', 'g');

>> These might be a bug, but the behavior doesn't seem to me that it'd be
>> terribly well defined in any case. The function should be pulling the
>> match to the parenthesized subexpression, but here that subexpression
>> has got multiple matches --- which one would you expect to get?

> Perl seems to return always the last one, but the last one is never just
> 'p' - so I also think that Julian has spotted a bug.

Well, Perl is not the definition of correct regexp behavior ;-). It's
got a completely different regexp engine in it, and so you shouldn't
be surprised if a poorly-specified regexp gives different results.
(The regexp engine we use was borrowed from Tcl, not Perl. It has
some strengths and some weaknesses compared to Perl's.)

It does appear that our engine agrees with Perl's that the thing to do
with something like this is to return the last substring matching the
quantified expression. However, it appears to define that as the last
possible match, not what would be left over after removing the first N-1
matches left-to-right. It's possible to match the parenthesized
subexpression to just the trailing 'p', which is what it tries first,
and so that's what you get.

The right way to deal with this, I think, is to add constraints so that
the boundaries for the sub-matches are not ambiguous. Try adding
(?![^(at)(dot)]) after the [^(at)(dot)]+(dot)

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dan Kortschak 2009-12-09 03:13:24 Re: how to ensure a client waits for a previous transaction to finish?
Previous Message Bruce Momjian 2009-12-09 02:39:12 Re: Installing PL/pgSQL by default