V Lecture 5 — regular expressions
V any hw1 questions?
V performance review
* what data to collect
V data format
* time format string
* redirecting output
* gnuplot
V image recovery
* handling multiple arguments
* handling bad arguments
* reminder: error messages to stderr
* getting width, height
* getting and formatting pixel values
* computing pixel locations
* output filename
V regular expressions (regex)
V a special language for describing patterns
V when searching and modifying text, can find everything matching a pattern, not just a single literal string
* useful for data validation (usernames, emails, passwords, etc.), data scraping, syntax highlighting, and many more
V supported in all sorts of places
* including .NET languages (C#, F#), C, C++, Java, JavaScript, Python
* lots of different flavors with variations in features and syntax — I will be talking about POSIX regular expressions
* you will use them on hw2 to extract data from raw html
V plain old strings are regular experssions
* Aaron matches exactly one string: ‘Aaron’
V character classes
* putting characters inside [ ] means match exactly one of the characters inside
* [Aa]aron matches both ‘Aaron’ and ‘aaron’
V character classes can have ranges inside them, so [a-d] matches ‘a’ or ‘b’ or ‘c’ or ‘d’
* [0-9] will match a single digit
* individual characters and ranges can all be thrown together: [-x0-9a-f] matches characters in hexadecimal numbers
* order doesn’t matter inside [ ]
V including a ^ at the front of a character class matches any character not in the class
* [^0-9] will match any non-digit character
V exercise: write a character class that will match any alphanumeric character or an underscore (valid variable name characters)
* [A-Za-z0-9_]
V special characters
* . matches any character
* \s matches whitespace
* \w matches word characters (alphanumeric and underscore)
V a number of built-in classes
* including [:alnum:], [:alpha:], [:blank:], [:digit:], [:lower:], [:upper:], [:punct:], [:space:], [:xdigit:]
* the [ ] are part of the name of the special class, so to match digits, you would use [[:digit:]]
V exercise: write a regex that matches 4 digit numbers that uses a comma, a space, or a period between the thousands and hundreds digits
* \. matches a literal period
* [0-9][, \.][0-9][0-9][0-9]
V quantifiers
* can repeat parts of your pattern
* * (zero or more times)
* + (one or more times)
* ? (zero or one time)
* {n} (exactly n times)
* {n,} (at least n times)
* {n,m} (at least n times but no more than m times)
V [Aa]*aron matches
* aron
* Aaron
* aaron
* Aaaron
* aAaron
* aaaron
*
* note that * always succeeds since any pattern can match zero times
V grouping
V quantifiers apply to the pattern immediately to the left
* 0abc+0 matches 0abc0, 0abcc0, 0abccc0, not 0abcabc0
* use ( ) to group characters together: 0(abc)+0
V you can also refer to a previously matched group later on in a regex
* especially important for substitution — see next lecture
* \n where n is a number refers to the nth group
V \b(\w+)\s+\1\b would match consecutive duplicate words
* \b is the boundary of a word
V escaping
* if you want to match a literal * or +, you put a \ in front of it, just like we matched a literal . with \.
* same is true for any character with special meaning
V greediness
* quantifiers are greedy, meaning they match as much text as they possibly can
* On the string “Hello,” she said. “How are you?”, you might expect ".+" to only match “Hello”, and be surprised when it matches the entire string
V anchors
* ^ at the start of a regex will force it to match starting at the beginning of a line
* $ at the end of a regex will force it to match ending at the end of a line
V exercise: phone number
V matches the following
* 123-456-7890
* (123) 456-7890
* 123 456 7890
* 123.456.7890
* ^\(?\d{3}\)?[\s\.-]\d{3}[\s\.-]\d{4}$
V egrep examples
* word games