Steve Litt's Perls of Wisdom: Perl Regular Expressions (with snippets)
Copyright (c) 1998-2001 by Steve Litt
IntroductionWithout regular expressions, PERL would be a fast development environment. Probably a little faster than VB for console apps. With the addition of regular expressions, PERL exceeds other RAD environments five to twenty-fold in the hands of an experienced practitioner, on console apps Whose Problem Domains include Parsing (AND That's A Heck of a Lot of Them).
Regular expressions is a HUGE area of knowledge, bordering on an art. Rather than regurgitate the contents of the PERL documentation or the plethora of PERL books at your local bookstore, this page will attempt to give you the 10% of regular expressions you'll USE 90% of the time. Note That for this Reason We Assume All strings to be single-line strings containing no newline chars.
What of the Aregegular Expressions Are A Syntax, Implement IN Perl and Certain Other Environments, Making It Not Only Possible But Easy To Do Some of The Following: Complex String Comparisons
$ String = ~ m / sought_text /; # m before the first slash is the "match" operator. Complex string Selections
$ String = ~ m / wherever (saf) WhatVER2 /; $ SOUGHTTEXT = $ 1; Complex String Replacements
$ String = ~ Tr / OriginalText / NewText /; # Tr Before First Slash IS "Translate" Operator. Parsing based on the Above Abilities
Doing string Comparisonswe Start with string Comparisons Because They're The Easiest, And Yet Most of What's Contained Here Is Applicable in Selecting and Replacing Text.
Simple String ComparisonSthe Most Basic String Comparison IS
$ String = ~ m / sought_text /; The above returns true if string $ string contains substring "sought_text", false otherwise If you want only those strings where the sought text appears at the very beginning, you could write the following.:
$ String = ~ m / ^ snought_text /; similarly, The $ operator indeicates "end of string". if you wanded to find outness text text Was The Very Last Text in the string, you cald Write this:
$ String = ~ m / sought_text $ /; now, if you want $ string contacts the sought and nothing but this: SIMPLY DO this:
$ String = ~ m / ^ Sough_text $ /; now what if you want the comparison to be copy insensitive? all you do is add the letter i after the ending delimiter:
$ String = ~ m / ^ Sough_text $ / i;
Using Simple "Wildcards" and "Repetitions" Calling these "wildcards" may actually conflict with the theoretical grammer and syntax of PERL, but in fact is the most intuitive way to think of it, and will not lead to any coding mistakes .. Match Any Character / W Match "Word" Character (Alphaneric Plus "_") / W Match Non-Word Character / s Match Whitespace Character / s Match Digit Character / D Match Non-Digit Character / T Match Tab / N Match Newline / R Match Return / F Match Formfeed / A Match Alarm (Bell, Beep, ETC) / E Match Escape / 021 Match Octal Char (in this case 21 octal) / XF0 Match HEX Char (in this case f0 Hexidecimal) You Can Follow Any Character, Wildcard, or Series of Characters and / or Wildcard with a repetiton. Here's where you start getting Some Power:
* Match 0 or more Times? Match 1 or 0 Times {n} match exactly n Times {n,} match at least n Times {n, m} match at Least N But Not More Than M TimeSnow for Some Examples:
$ String = ~ m // s * REM / I; #True if the first printable text is Rem or Rem
$ String = ~ m / ^ / s {1,8} /./ s {0,3} /; # check for dos 8.3 filename # (Note A Few Illegals CAN Sneak Thru)
Using groups () in matching
Note: Many Situations Can Be Done Either with groups () or character classes []. Groups areling quirky and the more often yield the results you at the result.
Groups are regular expression characters surrounded by parentheses. They has two major users:
To allow alternative phrases as in /. (Clinton | Bush | Reagan) / i Note that for single character alternatives, you can also use character classes As a means of retrieving selected text in selection, translation and substitution, used with the $ 1,. $ 2, etc Scalers. This section will discuss Only the first use. To see more it, click, click.
Powerful Regular Expressions Can Be Made with Groups At ITS Simplest, You CAN Match Either All Lowercase or Name Case Like this:
IF ($ string = ~ m / (b) ILL (c | c) linton /) {print "it is clinton, all right! / n"} detect all strings containing Vowels
IF ($ String = ~ m / (a | e | i | o | u | y | a | e | i | o | u | y) /) {Print "String Contains a Vowel! / n"} detect if the Line Starts with any of the last three president
IF ($ String = ~ m / ^ (CLINTON | BUSH | REAGAN) / i) {Print "$ String / N"}; Note That The Preenthesized Element Will Appear AS $ 1 Statements That Follow The Regular Expression. That Ok. if you Don't Want To Use $ 1, Just Ignore It. The Use of $ 1, etc, Will Be Explained in The Section ON
Doing string selections.
Using Character Classes [] Character classes are alternative single characters within square brackets, and are not to be confused with OOP classes, which are blueprints for objects. If not used carefully, they can yield unexpected results. Remember that
Groupsare an alternative.
Character Classes Have Three Main Advantages:
Shorthand Notation, AS [Aeiouy] INSTEAD OF (A | E | I | O | U | Y). This Advantage Is Minor At Best. Character Ranges, Such As (AZ]. One to One Mapping from On Class to Another, AS IN TR / [AZ] / [AZ]. this is essential! it will be distussed in the section on translations.
The Whole Thing in The Square Brackets Repesents Exactly One Character !!! Did i shout Loud ENOUGH? It May Be Tempting to Do Something Like this: IF ($ String = ~ / [CLINTON | BUSH | REAGAN] /) {$ OFFICE = "President"} The Above May Even Appear to Work Upon Casual Testing. Don't do it. Remember That Everything Inside The Brackets Represents One Character, Simply Listing All It's Alternative Possibilities.brutibilities.
Other Quirksi Haven't Fully Investigated this Yet, But Character Classes Seem to Sometimes Do Goofy Things in Regular Expressions Where The Case Ignored (i after the trailing delimiter).
Special Characters Inside the Square BracketsAs we've already seen, a hyphen is used to indicate all characters in the colating sequence between the character on the hyphen's left and the character on its right.
. An uparrow (^) at immediately following the opening square bracket means "Anything but these characters", and effectively negates the character class For instance, to match anything that is not a vowel, do this:
IF ($ String = ~ / [^ aeiouyaeiouy] /) {Print "this string contains a non-vowel"} Contrast to this:
IF ($ String! ~ / [Aeiouyaeiouy] /) {Print "this string contains no vowels at all"}
Best Uses of Character Classesprint All people Whose Name Begins with a through e
IF ($ String = ~ m / ^ [a-e] /) {Print "$ String / N"} If Character Classes Are Giving You Quirky Results, Consider Using
Matching: Putting it All TogetherPrint everyone whose last name is Clinton, Bush or Reagan Each element of list is first name, blank, last name, and possibly more blanks and more info after the last name Study this til you understand it...
IF ($ string = ~ m / ^ / s / s (clinton | bush | reagan) / i) {print "$ String / N"}; Print Every Line with a Valid Phone Number.if ($ String = ~ m / [/)/-]/d {3 }-/d{4 }[/S/.//?]/) {print "Phone line: $ String / n"}
Doing String Selections (Parsing) If regular expressions' only benefit was looking for a (albeit complex) string within a string, it would not be worth learningl. Regular expressions (and PERL itself, for that matter) really start earning their keep by .
For instance, create a program whose input is a piped in directory command and whose output is stdout, and whose output represents a batch file which copies every file (not directory) older than 12/22/97 to a directory called / oldie. This Would Be Pretty Nasty In C or C . The Directory Output Would Look Something Like this:
Volume In Drive D Has No Label Volume Serial Number IS 4547-15E0 Directory of D: / Polo / Marco.
While (
MY ($ TOTALBYTES) = 0; while ($ line) = $ _; chomp ($ line); if ($ line! ~ /
Doing SubstitutionSreplace Every "Bill Clinton" with an "al gore"
$ String = ~ S / Bill Clinton / Al Gore /; Now Do It Ignoring The Case of Bill Clinton. $ String = ~ S / Bill Clinton / Al Gore / i;
Doing TranslationsTranslations are like substitutions, except they happen on a letter by letter basis instead of substituting a single phrase for another single phrase For instance, what if you wanted to make all vowels upper case.:
$ String = ~ Tr / [a, E, I, O, U, Y] / [A, E, I, O, U, Y] /; Change Everything to Upper Case:
$ String = ~ TR / [A-Z] / [A-Z] /; Change Everything to Lower Case
$ String = ~ TR / [A-Z] / [A-Z] /; Change All Vowels To Numbers To Avoid "4 Letter Words" in a serial number.
$ String = ~ Tr / [a, E, I, O, U, Y] / [1, 2, 3, 4, 5] /;
Greedy and ungreedy matchingperl regular expression. Normal Match The Longest String Possible. For instance:
MY ($ text) = "mississippi"; $ text = ~ m / (i. * s) /; print $ 1. "/ n"; Run The Preceding Code, and here's what you get:
Ississit Matches The First I, THE Last S, AND Everything in Between the, but you want to match the numbert i to the s MOST Closely Following it? Use this code:
MY ($ text) = "mississippi"; $ text = ~ m / (i. *? s) /; print $ 1. "/ n"; Now Look What The code Products:
ISCLEARLY, The Use of the Question Mark Makes The Match Ungreedy. But Theres Another Problem in That To Match AS Early As Possible. Read on ...
Resolving Doubledots in A FilepathDoubledots are placefillers for "go up one directory" in a file path. Typically, when you desire to create an absolute path, you want to resolve them by deleting them and the level of directory above them. For instance,
//b/../whatever becomes
/ a / whatver.
. This is MUCH trickier than it might seem It's likely that all your ideas about greedy matching, replacement strings and the like will not work Here's the regular expression to resolve A SINGLE double dot:. $ Text = ~ s /// [^ ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// ' Doubledots in a string as long as thing string has only one doubledot. But The Plot Thickens ...
Doubledots Can Occur Alternative with Directories (//b/../c/../d) or nested (//b/c/../../d). The Best Way I've Found To Reliably Resolve All Doubledots Is To Make A Function That Loops Through The Preceding Regular Expression Until there is no more doubledots. here's the function:
Sub deletedoubledots ($) {While ($ _ [0] = ~ m //. /) {$ _ [0] = ~ s ///////]*//.///; .}
Kewl Splitpath One Liner Regexcheck Out this Splitpath Command:
MY ($ text) = "/ etc / sysconfig / network-scripts / ifcfg-eth0"; My ($ DIRECTORY, $ filename) = $ text = ~ m /( (//) (.*) $/; print "D = $ DIRECTORY, F = $ filename / n"; Is That Cool or What?
... Using a Variable as a Match ExpressionYou can use a variable inside the match expression This yields tremendous power Simply place the variable name between the forward slashes, and the expression will be sought in the string Here's an example:
#! / usr / bin / perl -w # used ($$) {MY $ LOOKFOR = Shift; My $ String = Shift; Print "/ N $ lOKFOR"; if ($ string = ~ m) $ {print "is in";} else {print "is not in";} print "$ string."; if (Defined ($ 1)) {print "<$ 1>";} print "/ n TEST ("St.V.", "Steve Was Here"); TEST ("St.V.", "Kitchen Stove"); TEST ("St.V.", "Kitchen Store"); The Preceding Code Produces The Following Output.
[Slitt @ mydesk slittle] $ ./junk.plst.v. is in steve was here. As You Can See, You Can Seek A Regex Expression Stored In A Variable, And You Can Retrieve The Result In $ 1. Symbol Explanations: . = ~ This operator appears between the string var you are comparing, and the regular expression you're looking for (note that in selection or substitution a regular expression operates on the string var rather than comparing) Here's a simple example: $ String = ~ m / bill Clinton /; #RETURN TRUE IF VAR $ String Contains The name of the president $ string = ~ tr / bill clinton / al gore /; #Replace the President with the vice president ! ~ Just Like = ~, Except Negated. With match, returns true if it doesn't match. I can't imagine what it would do in translates, etc. . / This is the usual delimiter for the text part of a regular expression If the sought-after text contains slashes, it's sometimes easier to use pipe symbols (|) for delimiters, but this is rare Here are simple examples:. $ String = ~ m / Bill Clinton /; #return true if var $ string contains the name of the president $ string = ~ tr / Bill Clinton / Al Gore /;. #Replace the president with the vice presidentmThe match operator Coming before the opening delimiter, this is the "match" operator. It means read the string expression on the left of the = ~, and see if any part of it matches the expression within the delimiters following the m. Note that if the delimiters are slashes ( Which is the Normal State of affairs), The M is Optional and offten not include. WHETHER ITHERE ORE NOT, IT'S STILL A MATCH OPERATION. Here Are Some Examples: $ String = ~ m / bill clinton /; #RETURN TRUE IF VAR $ String Contains The name of the president $ string = ~ / bill clinton /; #same result as present statement ^ This is the "beginning of line" Symbol. When use immediately after the starting delimiter, it signifies "at the beginning". For instance: $ String = ~ m / ^ Bill Clinton /; #True Only When "Bill Clinton" Is The First Text in The String $ This is the "end of line" symbol. When buy immediely before the ending delimiter, Itirdifies "At the end of the line". For instance: $ String = ~ m / bill Clinton $ /; #true only when "Bill Clinton" is the last text in the string Ithhis is The "case insensitivity" operator when used immediately after the closing delimiter. for instance: $ String = ~ m / bill Clinton / I; #True When $ String Contains "Bill Clinton" OR Bill Clinton "