Re: Regular Expression INTRO [long] [OT]

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: will trillich <will(at)serensoft(dot)com>
Cc: Justin Clift <justin(at)postgresql(dot)org>, pgsql <pgsql-general(at)postgresql(dot)org>
Subject: Re: Regular Expression INTRO [long] [OT]
Date: 2002-08-26 00:35:29
Message-ID: 200208260035.g7Q0ZTr25111@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Justin, have added this to techdocs? I have a followup email here with
some clarifications.

---------------------------------------------------------------------------

will trillich wrote:
> wow -- this was nearly a year ago and i still haven't
> done it!
>
> Justin Clift wrote:
> > Would you mind taking a few moments and writing a guide on using Regex
> > functions in PostgreSQL?
> >
> > This is to put on the techdocs.postgresql.org website so everyone in the
> > PostgreSQL community will have access to it.
>
> i'm posting to pgsql-general so the community
> can help find holes and plug them...
>
> here goes --
>
>
>
>
>
>
> Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):
>
>
> LIKE
> ====
>
> You're probably familiar with the WHERE clause
> of the SELECT statement --
>
> SELECT somefield FROM table WHERE somefield = somevalue;
>
> For example, to show all customers within the
> post code "60201" you could try
>
> SELECT * FROM cust WHERE zip = '60201';
>
> But if you also have nine-digit zips stored in the zip
> field (in the U.S.A. we have standard five-digit postal
> codes and also a four-digit extension for further
> geographic resolution) then this search will OMIT
> customers having more-specific zips such as '60201-9876'
> or '60201-1234'.
>
>
> The Percent "%"
> ---------------
>
> Try using LIKE instead of = (EQUALS):
>
> SELECT * FROM cust WHERE zip LIKE '60201%';
>
> With LIKE, your search values can use the percent (%)
> to tell PostgreSQL that anything can occupy that spot
> -- one character, twelve symbols or zero digits -- and
> still satisfy the search.
>
> On Unix or Linux, this is basically the same as the
> ASTERISK (*) at the command line, when dealing with
> file names:
>
> # list all files whose names begin with '.bash'
> ls .bash*
>
> # list all files containing 'out' anywhere in the name
> ls *out*
>
> # list all file names ending with '.pl'
> ls *.pl
>
> # list file starting with 'proj', ending with '.c'
> ls proj*.c
>
> With PostgreSQL using the LIKE operator, use the
> percent, instead:
>
> -- list all customers within the 47610 postal code
> SELECT * FROM cust WHERE zip LIKE '47610%';
>
> -- display customers who have 'Corp' in their names
> SELECT * FROM cust WHERE name LIKE '%Corp%';
>
> -- show customers whose names end in 'LLC'
> SELECT * FROM cust WHERE name LIKE '%LLC';
>
> -- documents beginning with 'We', ending with 'rica'
> SELECT * FROM doc WHERE contents LIKE 'We%rica';
>
> Wherever the '%' appears (using the LIKE operator)
> Postgres allows anything at all to match -- from a
> lengthy string of text, to one single character, to
> a zero-length string -- i.e. nothing at all.
>
> ...ILIKE 'A%Z'
> -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A(at)$*Z'
>
>
> The Underscore "_"
> ------------------
>
> For the LIKE operator, the UNDERSCORE (_) takes on the same
> meaning as the question mark does to shell operations for
> Unix and Linux file names:
>
> # list files starting with dot, followed by at least two chars
> ls .??*
>
> The underscore matches just one character exactly:
>
> SELECT * FROM atable WHERE afield LIKE '_';
> -- shows records where afield is exactly one character
> -- omitting ones where it's blank or has two (or more)
> -- characters in it)
>
> You can restrict your searches to finding fields
> of certain lengths this way:
>
> SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
> -- display all five- or nine-character zip codes
>
> Combine it with percent to find fields over a certain length:
>
> SELECT * FROM atable WHERE afield LIKE '___%';
> -- display records where afield has three or more characters
>
> Or you can accomodate some language idiosyncracies, as
> well:
>
> SELECT * FROM activity WHERE venue LIKE 'Theat__';
> -- find u.k. 'theatre' and u.s. 'theater'
>
>
> ILIKE
> =====
>
> But what if a customer's name is all capitals, as in
> 'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
> find it, because uppercase 'O' is not identical to
> lowercase 'o'!
>
> Finding text independent of uppercase/lowercase
> is something we often need to do. And so we have
> the ILIKE operator.
>
> Use ILIKE instead of LIKE when your search should
> ignore case -- the "I" stands for case-[I]nsensitive:
>
> -- show all customers with 'corp' in the name
> SELECT * FROM cust WHERE name ILIKE '%Corp%';
> -- ignoring whether the field is upper- or lowercase
>
> That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
> and even 'Amalgamated Switch and Relay corporation'
> because anything can ('%') precede or ('%') follow the
> string 'Corp' which can be uppercase or lowercase.
>
> But there's even more power in "regular expressions"!
> See below...
>
>
> Notes
> -----
>
> Always include some real data to search for!
>
> -- show everything, doing a lot of unnecessary work:
> SELECT * FROM atable WHERE afield LIKE '%';
>
> It would be rather silly to make PostgreSQL compare
> every record to see if it matched 'anything goes,
> in "afield"'. Just say no.
>
> Convert your spoken-language request to a LIKE
> clause like this:
>
> "afield ENDS WITH avalue"
> reword => "afield starts with anything, ending with avalue"
> sql => "... WHERE afield LIKE '%avalue'"
>
> "afield STARTS WITH avalue"
> reword => "afield starts with avalue, ending with anything"
> sql => "... WHERE afield LIKE 'avalue%'"
>
> See how that works?
>
> "afield CONTAINS avalue"
> sql => "... WHERE afield LIKE '%avalue%'"
>
> If you're looking for something at the beginning of
> a field, that's where your data goes, and you END with
> the percent. If you're looking for something at the end
> of a field, put your data there and allow anything at
> the front by putting the percent there.
>
> ALSO -- if your field is indexed, you'll defeat the index
> unless you anchor your search to the beginning of the field.
>
> -- find fields starting with 'something'
> ... afield LIKE 'something%'; -- uses afield's index
>
> The example below isn't anchored at the start of the field,
> so it does no good to use the index:
>
> -- find fields containing 'pooh'
> ... afield LIKE '%pooh%'; -- can't use afield's index
>
> Afield must contain 'pooh' BUT it could be anywhere in
> the field. The index would be useless.
>
>
> LIMITATION
> ==========
>
> Okay. Let's say you have a full-name field that has
> first, middle, and last name all in the one field.
> Sometimes you'll have a middle name, sometimes
> you won't. Sometimes the first name will be spelled
> out, other times it'll be shortened or even abbreviated
> to just the first initial.
>
> Here's how you'd search that field for "Abraham Lincoln"
> using the LIKE operator:
>
> SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';
>
> It'll find 'Abraham Lincoln' and 'A. Lincoln' and
> 'Abe Lincoln' because anything can go between the
> initial 'A' and the ending 'Lincoln'.
>
> But -- it'll also match 'Andrew M. Lincoln' and 'Abner
> Sasquatch Lincoln' as well... and this is why some bright
> souls created regular expressions!
>
>
> REGULAR EXPRESSIONS
> ===================
>
> A regular expression is a pattern to search for.
> The structure of the pattern has to match a
> rigid set of rules so that PostgreSQL will know
> what you're trying to search for.
>
> This introduction will barely scratch the surface,
> but hopefully it'll get you started--
>
> If you've ever tinkered with perl, you've probably
> encountered regular expressions:
>
> perl -ne 'print if /[A-Za-z]/;' somefile
>
> Here, the "[A-Za-z]" is a regular expression.
> It matches any line containing uppercase (A-Z)
> or lowercase (a-z) letters. In this example, the
> lines will come from the file "somefile" and
> lines that match will be printed out.
>
> Well, PostgreSQL can do that, too!
>
>
> Example
> -------
>
> SELECT * FROM person
> WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';
>
> The REGULAR EXPRESSION operator is the TILDE (~).
> The value you're searching for then must be a valid
> regular expression. In the above example, we're asking
> PostgreSQL to display all rows containing
>
> 'A'
>
> followed by EITHER
>
> 'be'
> OR
> 'braham'
> OR
> '\\.' -- a dot
>
> which is then followed by
>
> ' ' -- a space
> '*' -- zero or more times
>
> followed by
>
> 'Lincoln'
>
> and that's all. Whew!
>
> As you can guess, these will all successfully match:
>
> 'A. Lincoln'
> 'AbeLincoln' -- without any space at all
> 'Abraham Lincoln' -- with lots of spaces
>
> and these won't match:
>
> 'A Lincoln' -- no dot, no 'braham', no 'be'
> 'Abe Gump Lincoln' -- nothing will match 'Gump'
>
>
> Quoting "\\"
> ------------
>
> So what's with the BACKSLASH in front of the dot? Well,
> just as LIKE has the UNDERSCORE (_) to denote "any single
> character", REGULAR EXPRESSIONS use the DOT (.) for
> that very same purpose. So we have to "escape" the
> dot to alter its normal interpretation, using the.
>
> Note that string literals in PostgreSQL already use the
> backslash, so you'll have to double any backslash
> you use:
>
> psql=# SELECT * FROM person
> psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
> psql-# -- a second backslash added
> psql-# ;
>
>
> Grouping "()"
> -------------
>
> How about the PARENTHESES () and the BAR |, hmm? Well,
> that's how you can group permitted alternatives:
>
> ... ~ 'A(be|braham|\\.)'
>
> That says that 'A' can be followed by any of the three
> sub-expressions within the parentheses, namely 'be' or
> 'braham' or <dot>.
>
> These all produce the same matches:
>
> ... ~ 'A(be|braham|\\.)'
> ... ~ 'Abe|Abraham|A\\.'
> ... ~ 'A(b(e|raham)|\\.)'
>
> Let's dissect this last one -- it specifies that:
>
> 'A'
>
> can be followed by either
>
> 'b' followed by some more stuff
> or
> <dot>
>
> nestling into the details following the 'b' -- if
> 'A' is followed by 'b', then the 'b' must be followed
> by either
>
> 'e' -- which matches 'Abe'
> or
> 'raham' -- which matches 'Abraham'
>
> Do you see how powerful this can be?
>
>
> Character Classes "[]"
> ----------------------
>
> So you're trying to find "McAndrews" -- but maybe it's
> spelled with a lower-case "a" as in "Mcandrews", instead!
>
> SELECT * FROM person
> WHERE fullname ~ 'Mc[Aa]ndrews';
>
> Using the BRACKETS tells the regular expression
> parser that you're allowing any one of a whole class
> of characters in that spot.
>
> '[Aa]'
>
> It specifies that you're looking for either an uppercase
> 'A' or a lowercase 'a' in that spot.
>
> Here's how you can find fields containing vowels:
>
> ... ~ '[AEIOUaeiou]'
>
> Search for fields containing lowercase letters:
>
> ... ~ '[abcdefghijklmnopqrstuvwxyz]'
>
> Of course, there's a shortcut for specifying character
> classes that cover a whole range:
>
> ... ~ '[a-z]' -- also matches any lowercase letter
>
> Display fields that contain digits:
>
> ... ~ '[0-9]'
>
> Here's an example using a standard U.S. phone pattern
> (neglecting the area code for clarity and space):
>
> SELECT * FROM cust
> WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';
>
> That will show customers whose "descr" field contains a
> pattern of digits that looks like a U.S. phone number:
>
> '[0-9][0-9][0-9]' -- three digits
>
> followed by
>
> '-' -- a hyphen
>
> followed by
>
> '[0-9][0-9][0-9][0-9]' -- four digits
>
> As you can see, the hyphen doesn't do anything special
> for a regular expression unless it's inside the square
> brackets of a character class -- in which case it
> means "anything between".
>
> What if you want to allow a hyphen within a character class?
> Simply make it the first character inside the brackets:
>
> '[-.,_]'
>
> That class will match any hyphen, dot, comma or underscore.
>
>
> Bounds "{}"
> -----------
>
> Rather than having to specify each of those [0-9]
> monstrosities every time, let's abbreviate. Instead
> of:
>
> '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
>
> we can just say
>
> '[0-9]{3}-[0-9]{4}'.
>
> The BOUNDS specification, inside braces, comes immediately
> after whatever you want a certain number of. That is,
> first you specify what you're looking for, then you specify
> how many you need. (The default is {1,1} one, as you can tell:
> one 'A' followed by one 'b' followed by...)
>
> Inside the BRACES that specify your "bounds" you can
> say you want exactly so many, as we did above (3 and
> then 4) or you can give a low-to-high pair:
>
> ... ~ 'Z{3}' -- need three Z's
> ... ~ '@{5,}' -- five or more @'s
> ... ~ 'Q{2,5}' -- at least two, at most five, Q's
> ... ~ '(This|That){0,3}'
>
> The last one allows for 'This' or 'That', anywhere from
> zero to three times. So 'ThisThatThis' and '' will match!
> Convenient, hmm?
>
> And for very common cases, there are handy abbreviations:
>
> 'x{1,1}' 'x' one (the default)
> 'r{0,1}' 'r?' zero or one, i.e. 'x' is optional
> 'B{0,}' 'B*' zero or more
> 'z{1,}' 'z+' one or more
>
> These three abbreviations will pop up often:
>
> -- allow 'Ms' and 'Mrs' (the 'r' is optional):
> ... title ~ '(Mr|Mr?s|Dr)'
>
> -- maybe there's a space, maybe there isn't:
> ... lname ~ 'Mac ?Affee'
>
> -- don't care how many trailing spaces:
> ... lname ~ 'Smith *'
>
> -- find records with two or more even digits:
> ... afield ~ '[02468]{2,}'
> or
> ... afield ~ '[02468][02468]+'
>
> That last one breaks down to
>
> '[02468]' -- an even digit
>
> followed by
>
> '[02468]' -- an even digit
> '+' -- one or more times
>
> Remember: the plus means '{1,}' -- i.e. that the preceeding
> item must match one or more times.
>
> Getting back to our phone number pattern... this is
> the same request as above, but it's easier to
> see what's going on, using the {bounds} feature:
>
> SELECT * FROM cust
> WHERE descr ~ '[0-9]{3}-[0-9]{4}';
> -- three digits, hyphen, four digits
>
>
> Excluding Characters "[^]"
> --------------------------
>
> Note that the above pattern will also match 00000-0000
> because the last three digits of the five in the left
> chunk, with the four digits of the right chunk, fit
> perfectly into the constraints of what we asked for.
> And this ain't no phone number.
>
> Here's an approach to fixing that snag:
>
> SELECT * FROM cust
> WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';
>
> Note that we've prefixed the previous pattern with
>
> '[^0-9]'
>
> because, within the square brackets of a character class,
> the CARAT (^) means "anything EXCEPT..."
>
> So now, 00000-0000 will NOT match; this is what we're
> after.
>
> But there's a new problem: if the phone number is the
> very first thing in the "descr" field it would never
> match because we're demanding that there be SOMETHING
> (besides a digit) before the phone number. What to do?
>
>
> At The Very Beginning "^"
> -------------------------
>
> You can match "beginning-of-field" with the CARAT (^).
> (When it's the first thing inside square brackets, it
> negates the character class; outside brackets, a carat
> means "beginning-of-text". Got that? It's important!)
>
> SELECT * FROM cust
> WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';
>
> Here's what this pattern looks for:
>
> Either
> '^' -- at the very beginning of text
> or
> '[^0-9]' -- find something that's a non-digit
>
> followed by
>
> '[0-9]{3}' -- three digits
>
> and then a
>
> '-' -- hyphen
>
> and finally
>
> '[0-9]{4}' -- four digits
>
> and we've solved the 'at-beginning-of-field' problem.
>
>
> At The Very End "$"
> -------------------
>
> Still, we could inadvertently match something like
> 000-000000000 (probably an id number for a replacement
> part) which we don't want to do.
>
> Note how this is the same problem, for the end
> of the field, that we had earlier, with the beginning
> of the field? The solution is similar:
>
> SELECT * FROM cust
> WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';
>
> The DOLLAR SIGN ($) represents 'end-of-text' just as
> carat, outside of square brackets, means 'beginning-of-
> text'. The above pattern is the same as before, except
> we appended
>
> '($|[^0-9])'
>
> to the end, which means we're looking for
>
> either
>
> '$' -- at the very end-of-text
> or
> '[^0-9]' -- something that's not a digit
>
> after our previous pattern.
>
> And, finally, we have a comprehensive phone pattern --
> it'll match ###-#### while excluding other numeric
> arrangements -- in a text field.
>
>
> Anchoring
> ---------
>
> Did you notice that regular expressions aren't "anchored",
> as the LIKE expressions are? To specify that you're looking
> only at the beginning of a field, you need to use '^' and
> to include the end of a field, you must use '$'.
>
> It's a good thing regular expressions aren't anchored
> automatically -- we wouldn't have been able to specify
> our phone number pattern above! Things like '(^|[^0-9])'
> are possible because YOU get to specify when and how
> beginning-of-text is required. Cool, eh?
>
> And, just as in the LIKE operator, unless regular expressions
> ARE anchored at the beginning of a field, you'll defeat any
> index you have for that field. Indexes help alphabetize by
> comparing the beginning of your fields to each other; unless
> you're looking at the beginning of the field, your index can't
> be used.
>
> Here are some comparisons between operators LIKE and ~ :
>
> -- list all customers within the 47610 postal code:
> SELECT * FROM cust WHERE zip LIKE '47610%';
> -- zip begins with '47610' then has anything
> SELECT * FROM cust WHERE zip ~ '^47610';
> -- zip begins with '47610'
>
> -- display customers who have 'Corp' in their names
> SELECT * FROM cust WHERE name LIKE '%Corp%';
> -- name contains 'Corp' with anything before and after
> SELECT * FROM cust WHERE name ~ 'Corp';
> -- name contains 'Corp'
>
> -- show customers whose names end in 'LLC'
> SELECT * FROM cust WHERE name LIKE '%LLC';
> -- name can have anything, with 'LLC' at the end
> SELECT * FROM cust WHERE name ~ 'LLC$';
> -- name must have 'LLC' at the end
>
> -- documents beginning with 'We', ending with 'rica'
> SELECT * FROM doc WHERE contents LIKE 'We%rica';
> -- starts with 'We', has anything, ending with 'rica'
> SELECT * FROM doc WHERE contents ~ '^We.*rica$';
> -- start with 'We', zero or more chars, end with 'rica'
>
> Remember that in regular expressions, DOT means 'any character'.
> Thus '.*' means 'any character, zero times or more' which, in
> English, means "anything can go here, including nothing at all".
>
> There are ways to anchor your searches to word boundaries,
> as well -- not just beginning-of-field and end-of-field. See
> your documentation for details.
>
>
> Case-Insensitive "~*"
> ---------------------
>
> If you're not worried about differentiating between
> uppercase and lowercase in your regular expressions,
> you could go full boar and try
>
> SELECT * FROM cust
> WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';
>
> Fortunately there's a quickie to make it easier for you --
> where you've been using the tilde (~) as your operator, use
> tilde-star (~*) instead:
>
> SELECT * FROM cust
> WHERE lname ~* 'Octavian';
>
> ILIKE is case-insensitive (where LIKE is case-specific) just
> as ~* is case-insensitive for regular expressions (where ~ is
> case-specific). Very handy!
>
>
> Example
> -------
>
> Email addresses can look like any of these:
>
> me(at)my(dot)net
> someone-unimportant9237(at)this(dot)little(dot)org
> first(dot)last(dot)title(at)obscure(dot)sub-net(dot)biggie(dot)com
> _weird_(at)somewhere(dot)out_there(dot)net
>
> A reasonably-functional email address pattern might be
> something like this:
>
> '[a-z0-9_]+([\\-\\.][a-z0-9_]+)*(at)[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'
>
> That breaks down to (using case-indifferent via ~*, of course):
>
> '[a-z0-9_]'
>
> any alphanumeric (including underscore) character
>
> '+'
>
> one or more times, followed by
>
> '([\\-\\.][a-z0-9_]+)'
>
> email stuff [1],
>
> '*'
>
> zero or more times, followed by
>
> '@'
>
> which is then followed by
>
> '[a-z0-9_]'
>
> alphanumerics (including underscore)
>
> '+'
>
> one or more times, followed by
>
> '([\\-\\.][a-z0-9_]+)'
>
> site stuff [2]
>
> '+'
>
> one or more times.
>
>
> As for [1] email stuff, it can be (zero or more of):
>
> '[\\-\\.]'
>
> hyphen or dot, followed by
>
> '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
> '+'
>
> one or more times.
>
>
> And as for [2] site stuff, it can be (one or more of):
>
> '[\\-\\.]'
>
> hyphen or dot, followed by
>
> '[a-z0-9_]'
>
> alphanumerics (or underscore)
>
> '+'
>
> one or more times.
>
> So the personal part of the address can just be a
> single word, with optional dot-or-hyphen joining other
> words (ZERO or more) onto it; this is followed by at-sign,
> which is followed by the site portion of the address,
> namely: any word, followed by ONE or more dot-or-hyphen-
> followed-by-another-word combo's.
>
> It's not a perfect match for all legal email patterns,
> (for example, bad(at)my-addr would match) but something like
> it might be sufficient depending on your requirements.
>
>
> CONCLUSION
> ==========
>
> Regular expressions are complicated, because they're
> powerful! (Or is it the other way around?)
>
> Check out "Pattern Matching" in the PostgreSQL manual
> for more details -- on my Debian "Potato" system it's in
> html/postgres/functions-matching.html under the
> /usr/share/doc/postgresql-doc/ directory.
>
> --
> mailto:will(at)serensoft(dot)com
> http://www.dontUthink.com/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2002-08-26 00:51:35 Re: Solved! MacOS X and external functions
Previous Message cnliou 2002-08-26 00:33:53 Re: Controling Rule's Firing Order