Re: Regexp matching: bug or operator error?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ken Tanzer <ktanzer(at)desc(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Regexp matching: bug or operator error?
Date: 2004-11-25 00:02:47
Message-ID: 22644.1101340967@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-general

Ken Tanzer <ktanzer(at)desc(dot)org> writes:
> Thanks for the quick responses yesterday. At a minimum, it seems like
> this behavior does not match what is described in the Postgres
> documentation (more detail below).

After looking at this more, I think that it is actually behaving as
Spencer designed it to. The key point is this bit from the fine print
in section 9.6.3.5:

A branch has the same preference as the first quantified atom in it
which has a preference.

("branch" being any regexp with no outer-level | operator)

What this apparently means is that if the RE begins with a non-greedy
quantifier, then the matching will be done in such a way that the whole
RE matches the shortest possible string --- that is, the whole RE is
non-greedy. It's still possible for individual items within the RE to
be greedy or non-greedy, but that only affects how much of the shortest
possible total match they are allowed to eat relative to each other.
All the examples I've looked at seem to work "properly" when seen in
this light.

I can see that this behavior could have some usefulness, and if need be
you can always override it by writing (...){1,1} around the whole RE.
So at this point I'm disinclined to vary from the Tcl semantics.

This does leave us with a documentation problem though, because this
behavior is surely not obvious from what it says in 9.6.3.5. If you've
got any thoughts about a better explanation, I'm all ears.

> Here's the actual regex we're working on--any help
> reformulating this would be great!

> select substring('Searching for log 5376, referenced in this text'
> FROM
> '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
> )(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');

I don't see that you need either the leading (?:.*?) or the trailing
(.*?) here, and if you dropped them then the first quantifier would be
the "s?" which is greedy so the curious case goes away. I suppose the
idea of adding (?:.*?) was to ensure that "log" will be matched to the
first possible place where it could match --- but that is true anyway,
per the first sentence of 9.6.3.5.

regards, tom lane

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Troels Arvin 2004-11-25 11:24:35 SQL conformance related patch
Previous Message Ken Tanzer 2004-11-24 17:53:41 Re: Regexp matching: bug or operator error?

Browse pgsql-general by date

  From Date Subject
Next Message Net Virtual Mailing Lists 2004-11-25 00:04:10 Re: pgdump of schema...
Previous Message Ed L. 2004-11-25 00:02:42 Query for postmaster stats start time?