Due Tuesday 11/12 at 1:00 pm. No late submissions accepted.
Submission: Gradescope
Specification: Spec
This assignment focuses on using regular expressions with grep
. A set of files required for this assignment - including what you’ll submit - are available in the file hw6.zip
:
wget https://courses.cs.washington.edu/courses/cse391/24au/homework/hw6/hw6.zip
unzip hw6.zip
You will submit three files to Gradescope: task1.sh
, task2.sh
, and task3.sh
. For each task, determine a single shell statement that will perform the requested operation(s), not the output of the command. While each statement must be one line, you may use input/output redirection operators such as >
, <
, and |
. Please be sure to write the entire statement in your solution, including any relevant input files.
To calculate your final score on this assignment, sum the individual scores from Gradescope.
- if your
score
is between[0, 1)
, you will get0
points - if your
score
is between[1.0, 1.5)
, you will get1
point - if your
score
is between[1.5, 2]
, you will get2
points
Task 1: grep -E
warmup¶
First, we’ll do some warmup exercises using grep
and regular expressions.
You may find it helpful to refer to the Regex Syntax on our reference page.
Problem 1¶
Write the command that matches all lines from names.txt
that contain at least one numeric character.
Problem 2¶
Write the command that matches all lines from names.txt
that are exactly 4 characters long and consist only of uppercase or lowercase characters.
Problem 3¶
Write the command that matches lines from names.txt
that “look like” a first and last name. For the purposes of this problem, both a first and last name should start with an uppercase letter, followed by one or more lowercase letters; first and last names should be separated by a single space. Your command should not match lines that do not exactly follow this format.
Note!
This problem is intentionally flawed: writing a regular expression to capture all possible human names is a difficult task, if not impossible. However, many pieces of software do use regular expressions like these to “validate” names. It’s worth thinking about, especially as you’ll encounter this in the future (either as a user or developer of software).
To quote Patrick McKenzie’s original post on this topic: “I have never seen a computer system which handles names properly and doubt one exists, anywhere.”
Task 2: Validating input with regexes¶
After a few weeks at FAANG, management has discovered that we need to start actually selling products to stay in business! You’ve been tasked with spinning up our customer account creation and billing team. It turns out that regular expressions are extremely helpful in validating user input - so we’ll give them a spin!
Problem 4: Username Validation¶
FAANG users sign up for accounts with usernames. After talking with the rest of your team, you’ve decided that usernames should:
- be at least three characters
- only contain (lowercase or uppercase) letters, digits,
.
,-
, and_
Write a grep
command that matches all (and only all) lines in usernames.txt
that are valid usernames.
Tip
To match the literal character -
in a character set, place it last, e.g. [abcde-]
. Escaping it with \-
does not work!
Problem 5: Email Validation¶
In addition to a username, customers must also provide a valid email address. You’ve heard that email address validation is actually very complicated, but the spec seems too long so you’ve decided to use a simplified version of a valid email address.
For the purposes of FAANG, a valid email address contains each of the following elements, in order:
- between 1 to 16 characters that are a lowercase letter, uppercase letter, or digit
- then, the
@
symbol - then, a domain (e.g.
gmail
) that consists of at least one lowercase letter - then, a period (i.e.
.
) - finally, a “top-level domain” or TLD (e.g.
com
), that consists of 2 or more lowercase letters
Write a grep
command that matches all (and only all) lines in emails.txt
that are valid emails.
Problem 6: Password Validation¶
We take security very seriously at FAANG. So, customers need to choose a strong password when creating an account. FAANG’s security experts have decided that a strong password must:
- be at least 12 characters long
- contain at least one uppercase character
- contain at least one lowercase character
- contain at least one digit
However, passwords may contain other characters as well.
Write a grep
command that matches all (and only all) lines in passwords.txt
that are strong passwords.
Tip
This is a very simplistic view of what a strong password actually is. Take a security class to learn more!
Problem 7: Credit Card Validation¶
To process payments, FAANG needs to validate credit cards entered on the site. As a startup, we only accept Visa and Mastercard (Amex’s fees are too high!!). Visa and Mastercard have different credit card formats:
- Mastercard
- begins with a
5
- exactly 16 digits long, grouped into sets of 4 digits separated by a space
- begins with a
- Visa
- begins with a
4
- between 13 and 16 digits long (inclusive), grouped into sets of 4 digits (last may be shorter) separated by a space
- begins with a
Write a grep
command that matches all (and only all) lines in cards.txt
that are valid Mastercard or Visa credit card numbers.
Fun Fact
Real-life credit card validation also involves a formula that helps prevent accidental errors; you do not need to do that for this assignment.
Problem 8: URL Validation¶
FAANG also lets users add personal websites to their profile page (why? who knows?). But, they want to make sure that these links work.
The spec for URLs also seems to complicated, so you’ve once again opted for a simple version. For FAANG users, a valid URL:
- optionally starts with either
http://
orhttps://
- then, has at least one domain-
.
pair (e.g.cs.uw.
orgoogle.
), where a domain-.
pair has:- a domain, which is one or more lowercase letters
- then, a period (i.e.
.
)
- then, a “top-level domain” or TLD (e.g.
com
), that consists of 2 or more lowercase letters - optionally, a trailing
/
Write a grep
command that matches all (and only all) lines in urls.txt
that are valid URLs.
Tip
You can put all sorts of complicated things in capture groups…
Task 3: Parsing CSE Web Logs with grep
¶
In this task, you’ll use grep
to parse an IP-anonymized snapshot of the CSE course web logs (from 4:00-5:59 PM on July 22nd). This is intended to model how software engineers use tools like grep
and regular expressions to filter through large amounts of data.
Each line in the file common_log
represents one request to a webpage on the CSE server. Roughly, its format looks like this:
[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]
Generally, each []
item is separated by a space (and values with a space in them are quoted). The [SIZE]
and [REFERRER]
can be -
or "-"
when the field is missing.
You won’t have to worry about most of these fields; for this task, we will focus on the [STATUS]
and [USER AGENT]
.
As an example example, consider the following line (indented for clarity):
[22/Jul/2024:14:12:33 -0700]
"GET /courses/cse391/24au/css/base.css HTTP/1.1" 200 159760
"https://courses.cs.washington.edu/courses/cse391/24au/"
"Mozilla/5.0 ... Safari/605.1.15"
courses.cs.washington.edu:443 - 6
- the
[TIMESTAMP]
is[22/Jul/2024:14:12:33 -0700]
- the
[STATUS]
is200
- the
[USER AGENT]
is"Mozilla/5.0 ... Safari/605.1.15"
Problem 9: Serving Statuses¶
A status code of 200
means that the request was successful. Write a single-line shell statement that only output entries in common_log
that contain the number 200
.
Hint: you should get 8874
entries.
Problem 10: Serving Statuses, Specifically¶
Oh no! You realized that the number 200
appears elsewhere, like in file paths that contain years (e.g. 2007
). Write a new single-line shell statement that only outputs entries in common_log
that contain the status code of 200
.
Hints:
- take advantage of the fact that you know the format of the log. How does this help you?
- you should get
8848
entries.
Problem 11: Searching Salacious Scrapers¶
Web servers (such as CSE’s) are frequently hit by web scrapers; this is even more common with generative AI’s popularity. For example, the log contains many requests from crawlers made by Google, OpenAI, and Bing. CSE IT is interested in seeing how many times these scrapers (or “bots”) ping our webserver.
Unfortunately, grep
ing for just bot
doesn’t work: bot
is a substring in other file paths.
After doing some research you’ve learned that crawlers are supposed to identify if they are a bot in the [USER AGENT]
field. In particular, each bot has a URL between a +http
and a )
in the user agent field with the text “bot” somewhere.
For example, see the +https://openai.com/gptbot)
in the following user agent string:
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"
Write a single-line shell statement that outputs all entries with a [USER AGENT]
matching this exact pattern (i.e. containing a +http
, some characters, and then a closing )
).
Hints:
- the negation meaning of
^
is your friend :) - to help you double-check your work, you should be getting
3057
entries.
Problem 12: Summing Scrapers¶
IT wants to research each of the individual bots that are scraping the CSE website. In particular, they want to know how many different bots are hitting the server (regardless of which page they requested).
For this problem, assume that bots are unique to their URL (the text between the +
and )
in the previous problem).
Write a single-line shell statement that outputs the number of unique bots have made requests to the CSE servers.
Three hints:
- the
-o
flag ofgrep
outputs only what text was matched. - remember what we saw in previous HWs on using
sort
,uniq
, andwc
together - both with respect to ordering, and building the solution up gradually! - the number should be much, much lower than
3057
!
(as an aside: all of these links should lead to valid websites. Do you recognize these companies?)