|
![collapse V](Expanded.png) |
|
Lecture 5 — regular expressions
|
|
|
![collapse V](Expanded.png) |
|
any hw1 questions?
|
|
|
![collapse V](Expanded.png) |
|
performance review
|
|
|
![*](LeafRowHandle.png) |
|
what data to collect
|
|
|
![collapse V](Expanded.png) |
|
data format
|
|
|
![*](LeafRowHandle.png) |
|
time format string
|
|
|
![*](LeafRowHandle.png) |
|
redirecting output
|
|
|
![*](LeafRowHandle.png) |
|
gnuplot
|
|
|
![collapse V](Expanded.png) |
|
image recovery
|
|
|
![*](LeafRowHandle.png) |
|
handling multiple arguments
|
|
|
![*](LeafRowHandle.png) |
|
handling bad arguments
|
|
|
![*](LeafRowHandle.png) |
|
reminder: error messages to stderr
|
|
|
![*](LeafRowHandle.png) |
|
getting width, height
|
|
|
![*](LeafRowHandle.png) |
|
getting and formatting pixel values
|
|
|
![*](LeafRowHandle.png) |
|
computing pixel locations
|
|
|
![*](LeafRowHandle.png) |
|
output filename
|
|
|
![collapse V](Expanded.png) |
|
regular expressions (regex)
|
|
|
![collapse V](Expanded.png) |
|
a special language for describing patterns
|
|
|
![collapse V](Expanded.png) |
|
when searching and modifying text, can find everything matching a pattern, not just a single literal string
|
|
|
![*](LeafRowHandle.png) |
|
useful for data validation (usernames, emails, passwords, etc.), data scraping, syntax highlighting, and many more
|
|
|
![collapse V](Expanded.png) |
|
supported in all sorts of places
|
|
|
![*](LeafRowHandle.png) |
|
including .NET languages (C#, F#), C, C++, Java, JavaScript, Python
|
|
|
![*](LeafRowHandle.png) |
|
lots of different flavors with variations in features and syntax — I will be talking about POSIX regular expressions
|
|
|
![*](LeafRowHandle.png) |
|
you will use them on hw2 to extract data from raw html
|
|
|
![collapse V](Expanded.png) |
|
plain old strings are regular experssions
|
|
|
![*](LeafRowHandle.png) |
|
Aaron matches exactly one string: ‘Aaron’
|
|
|
![collapse V](Expanded.png) |
|
character classes
|
|
|
![*](LeafRowHandle.png) |
|
putting characters inside [ ] means match exactly one of the characters inside
|
|
|
![*](LeafRowHandle.png) |
|
[Aa]aron matches both ‘Aaron’ and ‘aaron’
|
|
|
![collapse V](Expanded.png) |
|
character classes can have ranges inside them, so [a-d] matches ‘a’ or ‘b’ or ‘c’ or ‘d’
|
|
|
![*](LeafRowHandle.png) |
|
[0-9] will match a single digit
|
|
|
![*](LeafRowHandle.png) |
|
individual characters and ranges can all be thrown together: [-x0-9a-f] matches characters in hexadecimal numbers
|
|
|
![*](LeafRowHandle.png) |
|
order doesn’t matter inside [ ]
|
|
|
![collapse V](Expanded.png) |
|
including a ^ at the front of a character class matches any character not in the class
|
|
|
![*](LeafRowHandle.png) |
|
[^0-9] will match any non-digit character
|
|
|
![collapse V](Expanded.png) |
|
exercise: write a character class that will match any alphanumeric character or an underscore (valid variable name characters)
|
|
|
![*](LeafRowHandle.png) |
|
[A-Za-z0-9_]
|
|
|
![collapse V](Expanded.png) |
|
special characters
|
|
|
![*](LeafRowHandle.png) |
|
. matches any character
|
|
|
![*](LeafRowHandle.png) |
|
\s matches whitespace
|
|
|
![*](LeafRowHandle.png) |
|
\w matches word characters (alphanumeric and underscore)
|
|
|
![collapse V](Expanded.png) |
|
a number of built-in classes
|
|
|
![*](LeafRowHandle.png) |
|
including [:alnum:], [:alpha:], [:blank:], [:digit:], [:lower:], [:upper:], [:punct:], [:space:], [:xdigit:]
|
|
|
![*](LeafRowHandle.png) |
|
the [ ] are part of the name of the special class, so to match digits, you would use [[:digit:]]
|
|
|
![collapse V](Expanded.png) |
|
exercise: write a regex that matches 4 digit numbers that uses a comma, a space, or a period between the thousands and hundreds digits
|
|
|
![*](LeafRowHandle.png) |
|
\. matches a literal period
|
|
|
![*](LeafRowHandle.png) |
|
[0-9][, \.][0-9][0-9][0-9]
|
|
|
![collapse V](Expanded.png) |
|
quantifiers
|
|
|
![*](LeafRowHandle.png) |
|
can repeat parts of your pattern
|
|
|
![*](LeafRowHandle.png) |
|
* (zero or more times)
|
|
|
![*](LeafRowHandle.png) |
|
+ (one or more times)
|
|
|
![*](LeafRowHandle.png) |
|
? (zero or one time)
|
|
|
![*](LeafRowHandle.png) |
|
{n} (exactly n times)
|
|
|
![*](LeafRowHandle.png) |
|
{n,} (at least n times)
|
|
|
![*](LeafRowHandle.png) |
|
{n,m} (at least n times but no more than m times)
|
|
|
![collapse V](Expanded.png) |
|
[Aa]*aron matches
|
|
|
![*](LeafRowHandle.png) |
|
aron
|
|
|
![*](LeafRowHandle.png) |
|
Aaron
|
|
|
![*](LeafRowHandle.png) |
|
aaron
|
|
|
![*](LeafRowHandle.png) |
|
Aaaron
|
|
|
![*](LeafRowHandle.png) |
|
aAaron
|
|
|
![*](LeafRowHandle.png) |
|
aaaron
|
|
|
![*](LeafRowHandle.png) |
|
…
|
|
|
![*](LeafRowHandle.png) |
|
note that * always succeeds since any pattern can match zero times
|
|
|
![collapse V](Expanded.png) |
|
grouping
|
|
|
![collapse V](Expanded.png) |
|
quantifiers apply to the pattern immediately to the left
|
|
|
![*](LeafRowHandle.png) |
|
0abc+0 matches 0abc0, 0abcc0, 0abccc0, not 0abcabc0
|
|
|
![*](LeafRowHandle.png) |
|
use ( ) to group characters together: 0(abc)+0
|
|
|
![collapse V](Expanded.png) |
|
you can also refer to a previously matched group later on in a regex
|
|
|
![*](LeafRowHandle.png) |
|
especially important for substitution — see next lecture
|
|
|
![*](LeafRowHandle.png) |
|
\n where n is a number refers to the nth group
|
|
|
![collapse V](Expanded.png) |
|
\b(\w+)\s+\1\b would match consecutive duplicate words
|
|
|
![*](LeafRowHandle.png) |
|
\b is the boundary of a word
|
|
|
![collapse V](Expanded.png) |
|
escaping
|
|
|
![*](LeafRowHandle.png) |
|
if you want to match a literal * or +, you put a \ in front of it, just like we matched a literal . with \.
|
|
|
![*](LeafRowHandle.png) |
|
same is true for any character with special meaning
|
|
|
![collapse V](Expanded.png) |
|
greediness
|
|
|
![*](LeafRowHandle.png) |
|
quantifiers are greedy, meaning they match as much text as they possibly can
|
|
|
![*](LeafRowHandle.png) |
|
On the string “Hello,” she said. “How are you?”, you might expect ".+" to only match “Hello”, and be surprised when it matches the entire string
|
|
|
![collapse V](Expanded.png) |
|
anchors
|
|
|
![*](LeafRowHandle.png) |
|
^ at the start of a regex will force it to match starting at the beginning of a line
|
|
|
![*](LeafRowHandle.png) |
|
$ at the end of a regex will force it to match ending at the end of a line
|
|
|
![collapse V](Expanded.png) |
|
exercise: phone number
|
|
|
![collapse V](Expanded.png) |
|
matches the following
|
|
|
![*](LeafRowHandle.png) |
|
123-456-7890
|
|
|
![*](LeafRowHandle.png) |
|
(123) 456-7890
|
|
|
![*](LeafRowHandle.png) |
|
123 456 7890
|
|
|
![*](LeafRowHandle.png) |
|
123.456.7890
|
|
|
![*](LeafRowHandle.png) |
|
^\(?\d{3}\)?[\s\.-]\d{3}[\s\.-]\d{4}$
|
|
|
![collapse V](Expanded.png) |
|
egrep examples
|
|
|
![*](LeafRowHandle.png) |
|
word games
|
|
|