Regular Expressions in Ten Steps

Posted on Mar 21st, 2014
Categories: programming, lists

Knowing how to use regular expressions is a lot like knowing how to cook: even though it is an incredibly useful skill, many people avoid learning. I‘ve been teaching people about regular expressions for years now, and since my voice is starting to give out I thought that I would write it down.

What Are Regular Expressions?

Regular expressions are simply strings that we use to match other strings. If you give me a regular expression, perhaps ‘Ols[eo]n‘ and a string, perhaps ‘Olsen‘ or ‘Olson‘, I can tell you if the strings match the regular expression (they both do). Most computer users deal with this kind of pattern matching all the time — Windows users frequently look for Word documents by specifying ‘.doc‘. While the ‘.doc‘ type patterns are great for what they do, when you are doing complex pattern matching, you need a serious tool. Regular expressions are that tool.

Like all good tools, regular expressions make it easy to do simple things. For example, it is very easy to write a regular expression to:

  • Match any string that ends in .doc
  • Match any string that is all uppercase
  • Or all lowercase
  • Or Russ followed by anything followed by Olsen

You can also write regular expressions that match some pretty wild things. It is harder, but not really all that hard to write a regular expression that matches:

  • A string starting with a letter or ‘_‘ or ‘$‘ followed by any number of letters, numbers or underscores or dollar signs
  • An alpha numeric string, followed by an @, followed by one or more alpha numeric strings separated by ‘.‘ ending with the string ‘.com‘
  • A series of words separated by white space, each word consisting of a vowel followed by the string way or ending with a consonant followed by the string ay

The good news is that there are just ten steps between you and these regular expression things. My hope is that by the end of the ten steps you will be able to write the simple regular expression easily, and be able to puzzle out the more complex ones.

Step 1: If It Looks Like a Duck…

The good news is that with regular expression, ordinary numbers and letters just match themselves:

  • The regular expression aaa matches only aaa
  • The regular expression 1234 matches only 1234
  • Usually, the match is case sensitive, so the pattern Russ would not match russ

While letters and numbers match themselves, most of the special characters (the ones like + and *) have special meanings in regular expressions.

Step 2: There is No Question About the Dot

OK, if letter and numbers match themselves, what do the other characters match? Let‘s start with the dot, the period, the full stop, that thing at the end of this sentence. The dot matches any single character. So,

  • A single period, ., matches a and b and c, among other things.
  • Two periods, .., match aa and bb but not a, or b or c
  • Rus. matches Russ and Rust, but not Rusty
  • Foo. Matches Foot but not Foo

If you are familiar with the *.doc pattern matching that you do when you are trying to find a file, you may have been expecting to use the question mark ? for this. Get over it. The question mark does indeed have a special meaning in regular expressions, but it ain‘t what you think. Remember, in regular expressions, the dot matches any single character. Well almost any character: a dot will not match a line terminating character like a newline or a carriage return, but that for now just think of the dot as matching anything.

Step 3: Quiet on the Set!

The dot is great if I want to match any single character, but what if I want to match something less than any? For example, how would I write an expression that would match only the vowels, or only the numbers? Enter the set. Sets in regular expressions let you match any one of a given set of characters. What you do is just wrap the characters in square brackets: the regular expression [aeiou] will match any one of the vowels while the set [0123456789] will match any single digit. So,

  • [abc] will match a and b and c but not d
  • [xyz] will match x or y or z but not xyz since [xyz] only matches one character.
  • [bcdfgjklmnpqrstvwxyz] will match any single consonant
  • [Rr]uss [Oo]lsen will match my name with or without capitals.
  • [1234567890][1234567890] will match any two digit number
  • [abcdefghijklmnopqrstuvwxyz] will match any single lowercase letter.

Step 4: Home on the Range

Sets are great, but lets be honest No one wants to type [0123456789] let alone [abcdefghijklmnopqrstuvwxyz]. Fortunately, we have a nice short hand for just these situations. We can use a range: [0–9] does exactly what you expect, and so does [a-z]:

  • [a-z] is the same as [abcdefghijklmnopqrstuvwxyz]
  • [A-Z] is the same as [ABCDEF?XYZ]
  • [a-c] is the same as [abc]

You can also combine ranges:

  • [a-zA-Z] will match any single upper or lower case letter.
  • [a-z0–9] will match any single lowercase letter or number
  • [a-z0–9A-Z] will match any single letter or number

You can also combine ranges with ordinary sets:

  • [a-z123] will match any lowercase letter or 1 or 2 or 3
  • [a-zA-Z$_] will match any letter or _ or $

Step 5: The Star of the Show

There are just a couple of humps you need to get over to really understand regular expressions, and this is the first one: The asterisk * means to match zero or more of the thing that came just before it. Pause and think that through for a minute. Zero or more of the thing that came just before it.

What I mean is that A* will match zero or more As. The A is the thing that came before the star, and so the pattern will match zero or more As. The pattern AB* matches an A followed by any number of B‘s (B is the thing that came just before the star), So AB* will match all of the following:

  • A
  • AB
  • ABB
  • ABBB

See a pattern?

The star doesn’t have to be at the end of the regular expression either. It can be near the start: R*uss means “any number of Rs followed by ussr”. So uss, Russ, RRuss, RRRRRRRuss all match.

The star can be in the middle of the regular expression as well: Ya*Hoo will match YHoo as well as YaHoo and YaaaaaaHoo.

You can also use the star in combination with sets. The expression [aeiou]* will match any number of vowels: the whole [aeiou] pattern is the thing that came before the star. Likewise the expression [0–9]* will match any number of digits.

Finally, you are not limited to just one star in a regular expression. You can have any number you like: Ya*Ho* will match YH (remember that ZERO or more thing) as well as YaaaaaaHooo. The regular expression Wo*Bo*y* will match WoB and WooooooBoooooyyyyyyy and many things in between.

A Pause for Breath

Let’s stop for a minute and review. So far we have learned that regular expressions are just patterns that you can match against a string. We’ve also discovered that:

  • Letters and numbers just match themselves: 49 ducks matches only 49 ducks and nothing else.
  • The dot, ., matches any one character (other than those invisible things you find at the end of a line)
  • Square brackets match any one of the characters inside the brackets, so [abc] will match either a, or b or c but not x (wrong character), and not abc (that’s three characters, not one)
  • We can abbreviate long sets like [abcdefghijklmnopqrstuvwxyz] to [a-z]
  • The star, *, matches zero or more of the thing that came right before, so A* matches A, and AA, and AAAAAAA, and lots of other AAAA-like strings

OK, let’s move on to the steps six through ten…

Step 6: The Eternal Question: What Is .*?

If you grasp what is going on in this step you will have pretty much beaten regular expressions, so read this one carefully. Let’s start with a question: if a dot matches any single character, and a star matches any number of the thing that came before, what does .* match?

Take it a step at a time: the dot matches any single character. The star matches any number of the thing that came before. So .* will match any number of any character. In other words, .* will match anything!

Let me say it again: since the star matches any number of what came before, then .* is the same as . and .. and and so on. But . matches any single character, while .. matches any two characters, and so on. So .* will match anything.

Let’s look at some examples:

  • The regular expression A.** will match an *A followed by any number of anything, or, in plain English, an A followed by anything. So A.** will match *ABC, but also AQDFXJG, and A1234
  • The regular expression .Z* will match anything followed by the letter Z. So .Z* will match XYZ and RussellZ and F-F-fZ, but it will not match ZZZA since it doesn’t end with a Z.
  • The regular expression A.Z* will match an A followed by anything, followed by the letter Z. So A.Z* will match ABCXYZ and AZ and AF-F-fZ, but it will not match ZZZ since it doesn’t start with an A.

Keep in mind that since the dot will not match an end of line character, .* actually only matches anything on one single line.

As I say, spend the time to understand the .* thing and you will really be on your way with regular expressions.

Step 7: Friends of the Stars

OK, if we have the star beaten, it is pretty much down hill from here. The next regular expression special character we want to take on is the plus sign, +. The plus sign is very similar to the star: while the star matches ZERO or more of the thing that came before, the plus sign matches ONE or more of the thing that came before. So while A*B will match plain old A, the smallest thing that A+B will match is AB.

To repeat, + is the same idea as the star, but you gotta have at least one of those previous things:

  • The expression XYZ+ will match XYZ and XYZZ, but not plain old XY
  • The expression 10+1 will match 101, and 1001, but not 11

The question mark is another variant on the same theme: the question mark matches ZERO or ONE of the previous thing:

  • Russ? will match Rus and Russ and nothing else
  • A?B will match only B and AB and nothing else
  • XY?Z will match XYZ and XZ and nothing else

Step 8: In the Beginning there is a Hat

Regular expressions are frequently used by the find or search commands of text editors. Editors typically do not expect you to enter a regular expression for the whole line, instead they just scan the line looking for any little chunks of text that will match your regular expression. So if my file contains:

Billy The Kid
Bill Russell
Russell Olsen

And I search for Russell, my editor will find the second and third lines. But what if I only want lines that start with Russell? Enter the caret character ^. The caret (or circumflex, or hat, or whatever you call it) character matches the invisible bit of margin at the beginning of a line.

So, for example:

  • ^foo will match any line which begins with foo
  • ^A will match any line which begins with an A

The caret also has an evil twin: The dollar sign $ matches the end of a line:

  • foo$ will match any line ending with foo
  • ^A$ will match any line which consists of exactly an A
  • ^Russ?$ will match any line that consists of Rus or Russ
  • ^$ will match any empty line

Step 9: Reversing the Polarity of the Neutron Flow

It’s great that all of these special characters have all of this power, but what happens if you just want to match a plain old ^ or . or **? Guess what: you need another special character! And that character is the backslash \.

The backslash removes the magic from any special character that follows the backslash:

  • The regular expression A\*B matches only A*B
  • The regular expression \+\+ will match two plus signs ++
  • The regular expression \.* will match any number of actual dots

Finally – you knew this one was coming – two backslashes \\ matches a single backslash \.

Step 10: It’s an Either Or Situation

Sometimes the pattern we are looking for is either this or that: perhaps we need to find a line which is either starts with Russ or ends with Olsen. For this, we need the pipe | character. A pipe character separates two expressions and will match either one or the other:

  • The pattern larry|moe will match either larry or moe
  • The pattern .*xml|.*html will match many common file names
  • The pattern a|b is pretty much the same as ‘[ab]’ - remember step 3?
  • The pattern ^Russ|Olsen$ will match a line that either starts with Russ or ends with Olsen

You are not limited to only two patterns with the pipe:

  • The pattern larry|moe|curley will match any one of the three great actors
  • The pattern .*doc|.*xls|.*html|.*xml will match even more file names

Wrapping Up

Well there it is, the basics of regular expressions in ten not too horrible steps. The ten steps are only a start: You could write a web site or a a whole book about regular expressions. But I hope that the ten steps will be like learning to boil an egg; not too difficult, but enough to get you cooking.

comments powered by Disqus