Skip site navigation (1) Skip section navigation (2)

Section 9.6.3.5, Regular Expression Matching Rules

From: Kenneth Tanzer <ktanzer(at)desc(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Section 9.6.3.5, Regular Expression Matching Rules
Date: 2004-11-27 20:56:09
Message-ID: 41A8E9E9.6010709@desc.org (view raw or flat)
Thread:
Lists: pgsql-docspgsql-general
>
>
>You're ignoring the first rule of matching: when there is more than one
>possible match, the match starting earliest in the string is chosen.
>
Doh!  It also plainly says that in the very first sentence--I can't 
blame the docs for missing that! :)

How about this revision for 9.6.3.5?  I tried to simplify without 
leaving out any key information or changing the meaning, but I might 
have gone overboard.  Thanks for your help!

Ken

------------------------------------------

9.6.3.5. Regular Expression Matching Rules

If there is more than one way an RE can match a string, the first 
(leftmost) match will be used.  If the same starting position yields 
multiple matches, the match can be either greedy (longest substring) or 
non-greedy (shortest substring).

Both the RE as a whole and each subexpression can have its own 
greediness.  Greediness is inherited from left to right, so that outer 
subexpressions take precedence over their component subexpressions.  The 
greediness of an RE is set by the first subexpression that is either 
greedy or non-greedy. 

Parentheses in and of themselves do not effect greediness (although they 
may affect the precedence of subexpressions).

Most atoms, all constraints, and the quantifiers {m} and {m}? are 
neither greedy nor non-greedy (because their match lengths are static). 

Other normal quantifiers are greedy unless the non-greedy form is 
explicitly designated.  The | operator is greedy.  (Use '{m,m}' to force 
greediness, or '{m,m}?' to force non-greediness).

Lengths are counted in the result string, not the pattern used for 
matching.  An empty string is considered longer than no match at all.

If case-independent matching is specified, the effect is to replace all 
case-sensitive characters with bracketed expressions containing both 
cases, or, if already in brackets, the case counterpart is added to the 
list.  (e.g., 'A' acts as  '[aA]', '[AB12]' as '[AaBb12]'.)  

If newline-sensitive matching is specified, . and bracket expressions 
using ^ will not match the newline character (so that matches only cross 
newlines by explicitly using a newline escape) and ^ and $ will match 
the empty string after and before a newline respectively, in addition to 
matching at beginning and end of string respectively. But the ARE 
escapes \A and \Z continue to match beginning or end of string /only/.

Specifying partial newline-sensitive matching affects . and bracket 
expressions, but not ^ and $.  Inverse partial newline-sensitive 
("weird") matching affects ^ and $, but not . and bracket expressions. 
This isn't very useful but is provided for symmetry.

------------------------------------

Tom Lane wrote:

>Kenneth Tanzer <ktanzer(at)desc(dot)org> writes:
>  
>
>>But what about these two queries:
>>    SELECT substring('a' FROM 'a?|a?');
>>This returns a greedy 'a', similar to the example above.  But then why does
>>    SELECT substring('ba' FROM 'a?|a?');
>>return a non-greedy empty string?
>>    
>>
>
>You're ignoring the first rule of matching: when there is more than one
>possible match, the match starting earliest in the string is chosen.
>The longer-or-shorter business only applies when there are multiple
>legal ways to form a match starting at the same place.  In this case
>'a?' can form a legal match at the beginning of the string (ie, match
>to no characters) and so the fact that a longer match is available later
>in the string doesn't enter into it.
>
>  
>
>>With regard to the documentation, after re-reading it many times I'd 
>>have to say the information is all there, but it's hard to absorb.
>>    
>>
>
>I'd agree.  This section was taken nearly verbatim from Henry Spencer's
>man page for the regexp package, and with all due respect to Henry,
>it's definitely written in geek reference-page-speak.  Maybe a few
>examples would help.
>
>On the other hand, I don't want to try to turn the section into a regexp
>tutorial --- there are entire books written about regexps (I quite like
>the O'Reilly one, btw).  So there's a bulk-vs-friendliness tradeoff to
>be made.
>
>  
>
>>I think the main problem is that the term "preference" is used to
>>discuss greedy/non-greediness, as well as the words greedy &
>>non-greedy.
>>    
>>
>
>Good point.  It would help to use only one term.
>
>  
>
>>As an example, here's a couple of different possibilities for the second 
>>sentence of the section:
>>    
>>
>
>I like this one:
>
>  
>
>>b) If the RE could match more than one substring starting at that point, 
>>the match can be either greedy (matching the longest substring) or 
>>non-greedy (matching the shortest substring).  Whether an RE is greedy 
>>or not is determined by the following rules...
>>    
>>
>
>Given that intro, there's no need to use the word "preference" at all.
>Or almost --- what term will you use for "RE with no preference"?
>Perhaps you can avoid the question by pointing out that greediness
>only matters for quantifiers, since unquantified REs can only match
>fixed-length strings.
>
>The point you make here:
>
>  
>
>>c)  Like individual components of an RE, the entire RE can be either 
>>greedy (matching the longest substring) or non-greedy (matching the 
>>shortest substring). 
>>    
>>
>
>is also important, but probably needs to be a completely separate
>paragraph containing its own example.
>
>  
>
>>Do you think an edit along these lines would be helpful?  If so, I'd be 
>>willing to take a shot at re-writing that section.  Let me know.  Thanks.
>>    
>>
>
>Fire away.  Please send whatever you come up with to the pgsql-docs
>list.
>
>			regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: the planner will ignore your desire to choose an index scan if your
>      joining column's datatypes do not match
>  
>


In response to

Responses

pgsql-docs by date

Next:From: Peter EisentrautDate: 2004-11-27 21:27:13
Subject: Re: SQL conformance related patch
Previous:From: Bruce MomjianDate: 2004-11-27 05:19:46
Subject: Re: FAQ and Windows

pgsql-general by date

Next:From: Johan WehtjeDate: 2004-11-27 21:17:38
Subject: Re: Query on exception handling in PL/pgSQL
Previous:From: Woodchuck BillDate: 2004-11-27 20:14:58
Subject: Re: Why the current setup of pgsql.* and comp.databases.postresql.general are BROKEN

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group