any hw1 questions?

performance review

what data to collect

data format

time format string

redirecting output

gnuplot

image recovery

handling multiple arguments

handling bad arguments

reminder: error messages to stderr

getting width, height

getting and formatting pixel values

computing pixel locations

output filename

regular expressions (regex)

a special language for describing patterns

when searching and modifying text, can find everything matching a pattern, not just a single literal string

useful for data validation (usernames, emails, passwords, etc.), data scraping, syntax highlighting, and many more

supported in all sorts of places

including .NET languages (C#, F#), C, C++, Java, JavaScript, Python

lots of different flavors with variations in features and syntax — I will be talking about POSIX regular expressions

you will use them on hw2 to extract data from raw html

plain old strings are regular experssions

Aaron matches exactly one string: ‘Aaron’

character classes

putting characters inside [ ] means match exactly one of the characters inside

[Aa]aron matches both ‘Aaron’ and ‘aaron’

character classes can have ranges inside them, so [a-d] matches ‘a’ or ‘b’ or ‘c’ or ‘d’

[0-9] will match a single digit

individual characters and ranges can all be thrown together: [-x0-9a-f] matches characters in hexadecimal numbers

order doesn’t matter inside [ ]

including a ^ at the front of a character class matches any character not in the class

[^0-9] will match any non-digit character

exercise: write a character class that will match any alphanumeric character or an underscore (valid variable name characters)

[A-Za-z0-9_]

special characters

. matches any character

\s matches whitespace

\w matches word characters (alphanumeric and underscore)

a number of built-in classes

including [:alnum:], [:alpha:], [:blank:], [:digit:], [:lower:], [:upper:], [:punct:], [:space:], [:xdigit:]

the [ ] are part of the name of the special class, so to match digits, you would use [[:digit:]]

exercise: write a regex that matches 4 digit numbers that uses a comma, a space, or a period between the thousands and hundreds digits

\. matches a literal period

[0-9][, \.][0-9][0-9][0-9]

quantifiers

can repeat parts of your pattern

* (zero or more times)

+ (one or more times)

? (zero or one time)

{n} (exactly n times)

{n,} (at least n times)

{n,m} (at least n times but no more than m times)

[Aa]*aron matches

aron

Aaron

aaron

Aaaron

aAaron

aaaron

…

note that * always succeeds since any pattern can match zero times

grouping

quantifiers apply to the pattern immediately to the left

0abc+0 matches 0abc0, 0abcc0, 0abccc0, not 0abcabc0

use ( ) to group characters together: 0(abc)+0

you can also refer to a previously matched group later on in a regex

especially important for substitution — see next lecture

\n where n is a number refers to the nth group

\b(\w+)\s+\1\b would match consecutive duplicate words

\b is the boundary of a word

escaping

if you want to match a literal * or +, you put a \ in front of it, just like we matched a literal . with \.

same is true for any character with special meaning

greediness

quantifiers are greedy, meaning they match as much text as they possibly can

On the string “Hello,” she said. “How are you?”, you might expect ".+" to only match “Hello”, and be surprised when it matches the entire string

anchors

^ at the start of a regex will force it to match starting at the beginning of a line

$ at the end of a regex will force it to match ending at the end of a line

exercise: phone number

matches the following

123-456-7890

(123) 456-7890

123 456 7890

123.456.7890

^$?\d{3}$?[\s\.-]\d{3}[\s\.-]\d{4}$

egrep examples

word games