Re: Regexp matching: bug or operator error?

From: Kenneth Tanzer <ktanzer(at)desc(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Regexp matching: bug or operator error?
Date: 2004-11-26 22:26:48
Message-ID: 41A7ADA8.5080205@desc.org
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-docs pgsql-general

OK. I've been trying to get my mind around this, and think about ways
to improve the documentation (more about that below). I'm pretty sure
that I can see the general concept now, and am almost convinced that it
really does work as described. I guess I really don't like the whole RE
inheriting it's preference from the first preference it encounters. It
seems like it would be better to use a switch preceeding the whole
expression, as is done with (?i) for case-insensitivity. Designating
(?g) for greedy and (?G) for non-greedy seems like it would be less
inviting of human error. But that's mostly a quibble, as long as it
works consistently as described!

I've still got a few problems, though. Per 9.6.3.5, "An RE consisting
of two or more branches connected by the | operator prefers longest match."

Apply that to this example:

SELECT substring('abc' FROM '.*?|.*?');

which returns a greedy 'abc'. In this case, the whole RE is then
supposed to be greedy (because of the "|"). So I suppose that _might_
be said to work as described, even though the | in this case overrides
the expressed preference of both of its components (in which case I'm
just griping about not liking the way this was implemented. :) )

But what about these two queries:

SELECT substring('a' FROM 'a?|a?');

This returns a greedy 'a', similar to the example above. But then why does

SELECT substring('ba' FROM 'a?|a?');

return a non-greedy empty string? Using the logic from the previous
examples, this should do a greedy match and return 'a' as well. If this
isn't a bug, please explain why!

With regard to the documentation, after re-reading it many times I'd
have to say the information is all there, but it's hard to absorb. I
think the main problem is that the term "preference" is used to discuss
greedy/non-greediness, as well as the words greedy & non-greedy. This
makes it easier for humans to fail to connect the dots. People are more
likely to be familiar with greedy & non-greedy (especially those who've
used other regex's before), whereas preference isn't as clear. Perhaps
the term "greediness preference" could be used in place of "preference",
or "preference" could be dropped altogether.

In general, I think "prefers shortest match" could be replaced with "is
non-greedy", and "prefers longest match" could be replaced with "is
greedy". A little bit more context might be helpful as well, as in
example c) below.

As an example, here's a couple of different possibilities for the second
sentence of the section:

a) If the RE could match more than one substring starting at that point,
its choice is determined by its greediness /preference/: either the
longest substring (greedy), or the shortest (non-greedy).

b) If the RE could match more than one substring starting at that point,
the match can be either greedy (matching the longest substring) or
non-greedy (matching the shortest substring). Whether an RE is greedy
or not is determined by the following rules...

c) Like individual components of an RE, the entire RE can be either
greedy (matching the longest substring) or non-greedy (matching the
shortest substring).

Do you think an edit along these lines would be helpful? If so, I'd be
willing to take a shot at re-writing that section. Let me know. Thanks.

Ken Tanzer

Tom Lane wrote:

>Ken Tanzer <ktanzer(at)desc(dot)org> writes:
>
>
>>Thanks for the quick responses yesterday. At a minimum, it seems like
>>this behavior does not match what is described in the Postgres
>>documentation (more detail below).
>>
>>
>
>After looking at this more, I think that it is actually behaving as
>Spencer designed it to. The key point is this bit from the fine print
>in section 9.6.3.5:
>
> A branch has the same preference as the first quantified atom in it
> which has a preference.
>
>("branch" being any regexp with no outer-level | operator)
>
>What this apparently means is that if the RE begins with a non-greedy
>quantifier, then the matching will be done in such a way that the whole
>RE matches the shortest possible string --- that is, the whole RE is
>non-greedy. It's still possible for individual items within the RE to
>be greedy or non-greedy, but that only affects how much of the shortest
>possible total match they are allowed to eat relative to each other.
>All the examples I've looked at seem to work "properly" when seen in
>this light.
>
>I can see that this behavior could have some usefulness, and if need be
>you can always override it by writing (...){1,1} around the whole RE.
>So at this point I'm disinclined to vary from the Tcl semantics.
>
>This does leave us with a documentation problem though, because this
>behavior is surely not obvious from what it says in 9.6.3.5. If you've
>got any thoughts about a better explanation, I'm all ears.
>
>
>
>>Here's the actual regex we're working on--any help
>>reformulating this would be great!
>>
>>
>
>
>
>>select substring('Searching for log 5376, referenced in this text'
>> FROM
>> '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
>>)(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');
>>
>>
>
>I don't see that you need either the leading (?:.*?) or the trailing
>(.*?) here, and if you dropped them then the first quantifier would be
>the "s?" which is greedy so the curious case goes away. I suppose the
>idea of adding (?:.*?) was to ensure that "log" will be matched to the
>first possible place where it could match --- but that is true anyway,
>per the first sentence of 9.6.3.5.
>
> regards, tom lane
>
>

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Peter Eisentraut 2004-11-26 22:34:16 Re: [PATCHES] SQL conformance related patch
Previous Message Simon Riggs 2004-11-26 19:52:28 Re: [PATCHES] SQL conformance related patch

Browse pgsql-general by date

  From Date Subject
Next Message Jonathan Daugherty 2004-11-26 22:31:42 Bulk data insertion
Previous Message Woodchuck Bill 2004-11-26 22:09:41 Re: comp.databases.postgresql.* groups and RFD