CSE 374, Lecture 6: Regular Expressions + grep

Searching

Searching for things is a fundamental building block in using computers. We search for things on the Internet; we search for words in a paper we wrote to make sure we don't repeat ourselves; we search for files that we've seen before but forgot where they are; and many other things. In these cases, searching fundamentally comes down to matching an input string (the query) against some other strings that exist in the world (such as the Internet, the words in your thesis, file names, etc). We've seen three different ways to do this matching:

Exact matching. This is probably the most intuitive and straightforward way to match text - if the word is exactly the same. If you search in a webpage in Chrome or do Ctrl-s and type a word in emacs, you will be doing an exact match.
Globbing. "Globbing" is the use of a wildcard character to expand one string into a set of possible matches. We've seen this in shell filename metacharacters - if you do "ls .txt", you are using globbing in order to match the "" wildcard against any file in the current directory that ends with ".txt".
Regular expressions. "Regular expressions", or "regexes" for short, have some similarity with globbing but go way, way beyond it. Regexes are a sophisticated form of pattern matching, and they are commonly used in programming and computer science in order to do matching. Different users and programs have different flavors of regular expressions:
- theoretical computer science: regexes are a formal grammar for describing a set of strings
- grep: a basic program using the grammar to do pattern matching
- grep -E (egrep, or extended regular expressions): extends the grammar to do more powerful things.
- specific programming languages: each language may have its own distinct dialect of regular expressions (perl, python, etc)

Regular expressions are useful for a large variety of different applications. For example, how would you validate that a string is actually a phone number or an email address? If you were Google and were "crawling" the Internet, how would you extract URLs from a webpage? And many others, which we'll discuss today and tomorrow.

We'll be using the program/command "grep" in order to learn about regular expressions.

A bit of theory

In theoretical computer science, regular expressions are a formal grammar. We can express this grammar in terms of constants (constant sets of strings) and operators.

There are three constants:

(empty set) - the null set
(empty string) - no characters
a single character

And there are three operators to combine the constants:

Concatenation. Given a set of strings "R" and a set of strings "S", we can use "RS" to express the concatenation of any string in R with any string in S. For example, if R={"ab", "c"} and S={"d", "ef"}, then RS={"abd", "abef", "cd", "cef"}
Alternation. Given a set of strings "R" and a set of strings "S", we can express a pattern that consists of any string in either R or S (the "union" of the sets) by saying "R|S". For example, if R={"ab", "c"} and S={"d", "ef"}, then R|S={"ab", "c", "d", "ef"}
"Kleene star" (introduced by Stephen Kleene). Given a set of strings "R", "R" is the set of strings that contains the empty string and is formed by taking any string in R and concatenating it any number of times with any other strings from R. Think of this as the "zero or more" operator, just like we used in file metacharacter expansion. For example, if R={"ab", "c"}, then R={(empty string), "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ...}

Just as in mathematics, we can use parentheses to disambiguate between operators - (ab)c is different from abc. However, the general order of operations is:

Kleene star
Concatenation
Alternation

We'll be learning about a few other operators in this lecture, but they are no more expressive than the three core operators, and you can derive them from the three operators.

Exact match

While regular expressions can be very complicated, the simplest regular expression is just an exact match. If we have a dictionary of English words, then we can look through it for an exact match with a simple grep command.

    $ grep queueing words.txt

We're actually going to mostly use single quotes around the pattern when we use grep. Why is this? Well we want to be able to include spaces in the pattern, and if we use double quotes, the dollar sign ($) which we will introduce shortly would have a different meaning (shell variables).

    $ grep 'queueing' words.txt

Special characters

We'll learn special characters by going through a number of exercises (with the summary following). Each of the following expressions can be used in grep as "grep words.txt" to find some set of marching words. Note that sometimes we'll have to escape the operators with a slash ("\|" for example) but this can be different across different implementations of regular expressions. The rules for escaping are weird, and the general practice with regular expressions is to try a bunch of things until it works.

    # Search for words that start with "qu" and end with "ing"
    # and have 3 letters in the middle.
    'qu...ing'

    # Whoops! That doesn't quite work - it matches words that
    # have extra stuff before or after. Use "^" to specify the
    # beginning of a line and $ to specify the end of a line.
    '^qu...ing$'

    # Use .* to match one or more characters of any kind.
    # Matches quoting, queueing, quivering, etc.
    '^qu.*ing$'

    # We can use multiple .* in the same pattern. This one
    # matches "luck" anywhere in the word (and is actually the
    # same as the plain pattern "luck"). Note that we can see
    # here that grep/regular expressions are GREEDY - they will
    # expand the ".*" to match as many characters as possible.
    '.*luck.*'

    # What if we actually mean the period character and not a
    # wildcard? We can "escape" the period with a slash.
    '\.'

    # Lets look for all words that start with a (either capitalized
    # or lower case). We might try this, although it doesn't work.
    # The start-of-line character doesn't apply to the capital A.
    '^a\|A'

    # If we use parentheses (which need to be escaped), we can
    # express that the a|A operation takes priority.
    '^\(a\|A\)'

    # We can also use square brackets. Square brackets indicate that
    # the pattern can match any one character within the brackets,
    # in this case either a or A.
    '^[aA]'

    # Within square brackets, you can either list individual
    # characters like in the previous example or you can provide a
    # range of characters, such as in this pattern for any word
    # that begins with a capital letter.
    '^[A-Z]'

    # If we put two square bracket patterns together, we can express
    # any word that starts with two a's (either upper or lower case).
    '^[aA][aA]'

    # Alternatively, we can use {n} after a pattern to express that
    # the pattern should be repeated n times. This pattern will
    # therefore be exactly the same as the previous example.
    '^[aA]\{2\}'

    # Any number can be provided. This example finds words starting
    # with 3 a's (capital or lower case).
    '^[aA]\{3\}'

    # Parentheses actually do more than just signify order of operations.
    # They "capture" the characters that match them and you can refer to
    # the "capture group" later, just like you would use a variable. In
    # this example, if we use "\1", we capture words that that consist of
    # two repeated halves - like "tutu" or "papa". We call the "\1" a
    # "backreference".
    '^\([A-Za-z]*\)\1$'

    # We can use more than than one set of paretheses to make multiple
    # capture groups, and then refer to them by numbered backreferences.
    # This example captures the first three characters of any type in
    # a word and then mirrors them backwards ("\3\2\1") via backreference
    # to represent 6-letter palindromes.
    '^\(.\)\(.\)\(.\)\3\2\1$'

    # 4-letter palidromes, but ocurring anywhere in the word - we don't
    # have the line-start and line-end markers, so the whole word might
    # not be a palindrome. NOTE: you can't have more than 9 backreferences.
    '\(.\)\(.\)\(.\)\(.\)\4\3\2\1'

    # In this example, we use curly braces again, but provide two numbers
    # separated by a comma instead of one. {n,m} means that the previous
    # set of characters should be repeated at least n times (inclusive)
    # but not more than m times (inclusive). In this case, that means
    # we will find words with four or five vowels in a row.
    '[aeiou]\{4,5\}'

    # Another example: using capture groups and curly braces together to
    # produce three of the same vowel in a row:
    '\([aeiou]\)\1\1'
    '\([aeiou]\)\1\{2\}'  # either works

    # Alternatively we can put a ".*" in the capture group to signify
    # anything that might be around a vowel - this example will find any
    # word that has at least 10 vowels in it.
    '\([aeiou].*\)\{10\}'

    # Using capture groups and backreferences on the other hand, we can
    # find any word that has at least 7 of the SAME vowel.
    '\([aeiou]\).*\1.*\1.*\1.*\1.*\1.*\1'

    # Finally, we introduce negation. If you put "^" inside the square
    # brackets, it negates everything inside the brackes - so in this
    # case we want anything that isn't a vowel. This expression finds
    # all words that have no vowels at all.
    '^[^aeiouAEIOU]*$'

    # A few other special operators of interest. "+" indicates "one or
    # more", so we could also use that to find all words that have at
    # least one vowel in them.
    '[aeiouAEIOU]\+'

    # Finally, the question mark means "zero or one" match. So the following
    # regular expression will find words that start with "like" but might
    # also start with the negation "un".
    '^\(un\)\?like'

Summary of special characters

    .         # one character of any kind
    ^         # "anchor" to the beginning of a line
    $         # "anchor" to the end of a line
    x*        # zero or more copies of x in a row
    .*        # zero or more characters of any kind
    \.        # a literal "." character
    ()        # "capture group", also used to express priority of operations
    x|y       # either match of x or match of y
    [aA]      # match a single character either "a" or "A"
    [a-z]     # match any character in the range of "a" to "z"
    x{n}      # match exactly n copies of x
    x{n,m}    # match at least n copies of x but no more than m
    ([a-z])\1 # use a "backreference" to refer to the "capture group"
    [^aA]     # match any character that is not "a" or "A"
    x+        # match one or more copies of x
    x?        # match zero or one copy of x

Exercise

Write a regular expression that can be used with grep and that matches all formats of phone numbers:

    (206) 123-4567
    206 123 4567
    206.123.4567
    2061234567
    206-123-4567
    (206) 1234567

Answer:

        '(\?[0-9]\{3\})\?[- .]\?[0-9]\{3\}[- .]\?[0-9]\{4\}'