Skip site navigation (1) Skip section navigation (2)

Re: Regexp matching: bug or operator error?

From: Kenneth Tanzer <ktanzer(at)desc(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Regexp matching: bug or operator error?
Date: 2004-11-26 22:26:48
Message-ID: 41A7ADA8.5080205@desc.org (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-docspgsql-general
OK.  I've been trying to get my mind around this, and think about ways 
to improve the documentation (more about that below).  I'm pretty sure 
that I can see the general concept now, and am almost convinced that it 
really does work as described.  I guess I really don't like the whole RE 
inheriting it's preference from the first preference it encounters.  It 
seems like it would be better to use a switch preceeding the whole 
expression, as is done with (?i) for case-insensitivity.  Designating 
(?g) for greedy and (?G) for non-greedy seems like it would be less 
inviting of human error.  But that's mostly a quibble, as long as it 
works consistently as described!

I've still got a few problems, though.  Per 9.6.3.5, "An RE consisting 
of two or more branches connected by the | operator prefers longest match."

Apply that to this example:

    SELECT substring('abc' FROM '.*?|.*?');

which returns a greedy 'abc'.  In this case, the whole RE is then 
supposed to be greedy (because of the "|").  So I suppose that _might_ 
be said to work as described, even though the | in this case overrides 
the expressed preference of both of its components (in which case I'm 
just griping about not liking the way this was implemented. :) )

But what about these two queries:

    SELECT substring('a' FROM 'a?|a?');

This returns a greedy 'a', similar to the example above.  But then why does

    SELECT substring('ba' FROM 'a?|a?');

return a non-greedy empty string?  Using the logic from the previous 
examples, this should do a greedy match and return 'a' as well.  If this 
isn't a bug, please explain why!

With regard to the documentation, after re-reading it many times I'd 
have to say the information is all there, but it's hard to absorb.  I 
think the main problem is that the term "preference" is used to discuss 
greedy/non-greediness, as well as the words greedy & non-greedy.  This 
makes it easier for humans to fail to connect the dots.  People are more 
likely to be familiar with greedy & non-greedy (especially those who've 
used other regex's before), whereas preference isn't as clear.  Perhaps 
the term "greediness preference" could be used in place of "preference", 
or "preference" could be dropped altogether.

In general, I think "prefers shortest match" could be replaced with "is 
non-greedy", and "prefers longest match" could be replaced with "is 
greedy".  A little bit more context might be helpful as well, as in 
example c) below.

As an example, here's a couple of different possibilities for the second 
sentence of the section:

a) If the RE could match more than one substring starting at that point, 
its choice is determined by its greediness /preference/: either the 
longest substring (greedy), or the shortest (non-greedy).

b) If the RE could match more than one substring starting at that point, 
the match can be either greedy (matching the longest substring) or 
non-greedy (matching the shortest substring).  Whether an RE is greedy 
or not is determined by the following rules...

c)  Like individual components of an RE, the entire RE can be either 
greedy (matching the longest substring) or non-greedy (matching the 
shortest substring). 

Do you think an edit along these lines would be helpful?  If so, I'd be 
willing to take a shot at re-writing that section.  Let me know.  Thanks.

Ken Tanzer




Tom Lane wrote:

>Ken Tanzer <ktanzer(at)desc(dot)org> writes:
>  
>
>>Thanks for the quick responses yesterday.  At a minimum, it seems like 
>>this behavior does not match what is described in the Postgres 
>>documentation (more detail below).
>>    
>>
>
>After looking at this more, I think that it is actually behaving as
>Spencer designed it to.  The key point is this bit from the fine print
>in section 9.6.3.5:
>
>	A branch has the same preference as the first quantified atom in it
>	which has a preference.
>
>("branch" being any regexp with no outer-level | operator)
>
>What this apparently means is that if the RE begins with a non-greedy
>quantifier, then the matching will be done in such a way that the whole
>RE matches the shortest possible string --- that is, the whole RE is
>non-greedy.  It's still possible for individual items within the RE to
>be greedy or non-greedy, but that only affects how much of the shortest
>possible total match they are allowed to eat relative to each other.
>All the examples I've looked at seem to work "properly" when seen in
>this light.
>
>I can see that this behavior could have some usefulness, and if need be
>you can always override it by writing (...){1,1} around the whole RE.
>So at this point I'm disinclined to vary from the Tcl semantics.
>
>This does leave us with a documentation problem though, because this
>behavior is surely not obvious from what it says in 9.6.3.5.  If you've
>got any thoughts about a better explanation, I'm all ears.
>
>  
>
>>Here's the actual regex we're working on--any help 
>>reformulating this would be great!
>>    
>>
>
>  
>
>>select substring('Searching for log 5376, referenced in this text'
>>            FROM
>>        '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />| 
>>)(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');
>>    
>>
>
>I don't see that you need either the leading (?:.*?) or the trailing
>(.*?) here, and if you dropped them then the first quantifier would be
>the "s?" which is greedy so the curious case goes away.  I suppose the
>idea of adding (?:.*?) was to ensure that "log" will be matched to the
>first possible place where it could match --- but that is true anyway,
>per the first sentence of 9.6.3.5.
>
>			regards, tom lane
>  
>


In response to

Responses

pgsql-docs by date

Next:From: Peter EisentrautDate: 2004-11-26 22:34:16
Subject: Re: [PATCHES] SQL conformance related patch
Previous:From: Simon RiggsDate: 2004-11-26 19:52:28
Subject: Re: [PATCHES] SQL conformance related patch

pgsql-general by date

Next:From: Jonathan DaughertyDate: 2004-11-26 22:31:42
Subject: Bulk data insertion
Previous:From: Woodchuck BillDate: 2004-11-26 22:09:41
Subject: Re: comp.databases.postgresql.* groups and RFD

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group