Lecture 5 — regular expressions
any hw1 questions?
performance review
what data to collect
data format
time format string
redirecting output
image recovery
handling multiple arguments
handling bad arguments
reminder: error messages to stderr
getting width, height
getting and formatting pixel values
computing pixel locations
output filename
regular expressions (regex)
a special language for describing patterns
when searching and modifying text, can find everything matching a pattern, not just a single literal string
useful for data validation (usernames, emails, passwords, etc.), data scraping, syntax highlighting, and many more
supported in all sorts of places
including .NET languages (C#, F#), C, C++, Java, JavaScript, Python
lots of different flavors with variations in features and syntax — I will be talking about POSIX regular expressions
you will use them on hw2 to extract data from raw html
plain old strings are regular experssions
Aaron matches exactly one string: ‘Aaron’
character classes
putting characters inside [ ] means match exactly one of the characters inside
[Aa]aron matches both ‘Aaron’ and ‘aaron’
character classes can have ranges inside them, so [a-d] matches ‘a’ or ‘b’ or ‘c’ or ‘d’
[0-9] will match a single digit
individual characters and ranges can all be thrown together: [-x0-9a-f] matches characters in hexadecimal numbers
order doesn’t matter inside [ ]
including a ^ at the front of a character class matches any character not in the class
[^0-9] will match any non-digit character
exercise: write a character class that will match any alphanumeric character or an underscore (valid variable name characters)
special characters
. matches any character
\s matches whitespace
\w matches word characters (alphanumeric and underscore)
a number of built-in classes
including [:alnum:], [:alpha:], [:blank:], [:digit:], [:lower:], [:upper:], [:punct:], [:space:], [:xdigit:]
the [ ] are part of the name of the special class, so to match digits, you would use [[:digit:]]
exercise: write a regex that matches 4 digit numbers that uses a comma, a space, or a period between the thousands and hundreds digits
\. matches a literal period
[0-9][, \.][0-9][0-9][0-9]
can repeat parts of your pattern
* (zero or more times)
+ (one or more times)
? (zero or one time)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
[Aa]*aron matches
note that * always succeeds since any pattern can match zero times
quantifiers apply to the pattern immediately to the left
0abc+0 matches 0abc0, 0abcc0, 0abccc0, not 0abcabc0
use ( ) to group characters together: 0(abc)+0
you can also refer to a previously matched group later on in a regex
especially important for substitution — see next lecture
\n where n is a number refers to the nth group
\b(\w+)\s+\1\b would match consecutive duplicate words
\b is the boundary of a word
if you want to match a literal * or +, you put a \ in front of it, just like we matched a literal . with \.
same is true for any character with special meaning
quantifiers are greedy, meaning they match as much text as they possibly can
On the string “Hello,” she said. “How are you?”, you might expect ".+" to only match “Hello”, and be surprised when it matches the entire string
^ at the start of a regex will force it to match starting at the beginning of a line
$ at the end of a regex will force it to match ending at the end of a line
exercise: phone number
matches the following
(123) 456-7890
123 456 7890
egrep examples
