Homework 6
Regular expressions
This assignment involves writing shell statements with regular expressions that precisely match (and only match) the requested lines. Your answers may also utilize input/output redirection operators such as >
, <
, and |
. For this assignment, unless otherwise specified, “letters” should match both lowercase and uppercase letters.
Download the homework files on attu
After logging into attu
, download the homework files.
git archive --remote=git@gitlab.cs.washington.edu:cse391/25wi/hw6.git --prefix=hw6/ HEAD | tar -x
Task 1: Grep with regexes
Let’s practice writing grep
commands involving regular expressions. Write your answers on the indicated lines in the task1.sh
file in the hw6
folder.
- What is the
grep
command that matches all lines fromnames.txt
that contain at least one numeric character? - What is the
grep
command that matches all lines fromnames.txt
that are exactly 4 characters long and consist only of uppercase or lowercase characters? - What is the
grep
command that matches all lines fromnames.txt
that look like a first and last name: two words separated by a single space where each word begins with an uppercase letter followed by one or more lowercase letters.
This last problem is intentionally flawed: writing a regular expression to capture all possible human names is a difficult if not impossible task. However, since many real-world systems use regular expressions like these to “validate” names (and all sorts of other personal information), it’s worth thinking about which assumptions we want to make! To quote Patrick McKenzie’s original post on this topic:
I have never seen a computer system which handles names properly and doubt one exists, anywhere.
Task 2: Validating input with regexes
After a few weeks at FAANG, management has discovered that we need to start actually selling products to stay in business! You’ve been tasked with spinning up our customer account creation and billing team. Write your answers on the indicated lines in the task2.sh
file in the hw6
folder.
- What is the
grep
command that matches all the valid usernames inusernames.txt
, where a username is at least 3 letters, digits,.
,-
, or_
. To match the literal character-
in a character set, place it last as in[abcde-]
. Escaping it with\-
does not work! - What is the
grep
command that matches all the valid emails inemails.txt
. Validating real email addresses is quite complicated, so a valid email address:- starts with between 1 to 16 letters or digits,
- followed by the
@
symbols, - followed by a domain like
uw
that consists of at least one lowercase letter, - followed by a period
.
, and - ending in a top-level domain like
edu
, that consists of 2 or more lowercase letters.
- What is the
grep
command that matches all the strong passwords inpasswords.txt
, where a strong password contains:- at least 12 characters,
- at least one uppercase characters,
- at least one lowercase characters,
- at least one digit, and
- any other characters beyond these requirements.
- What is the
grep
command that matches all rewards card numbers incards.txt
. Whereas credit card validation will check the Luhn sum, our rewards card numbers match one of two patterns:- any 16-digit number beginning with a
5
and grouped into sets of 4 digits separated by a space, or - any 13-to-16-digit number beginning with a
4
and grouped into sets of 4 digits (where the last group may have fewer than 4 digits) separated by a space.
- any 16-digit number beginning with a
- What is the
grep
command that matches all valid URLs inurls.txt
. Validating real URLs is quite complicated, so a valid URL:- optionally starts with either
http://
orhttps://
, - followed by at least one domain-
.
pair of one or more lowercase letters followed by a.
, as incs.uw.
orgoogle.
- followed by a top-level domain like
edu
that consists of 2 or more lowercase letters.
- optionally starts with either
Task 3: Parsing CSE Web Logs with grep
Let’s use grep
to parse an anonymized snapshot of the CSE course webserver logs. This is intended to model how we can use tools like grep
and regular expressions to filter large amounts of data. Each line in the file common_log
represents one request to a webpage on the CSE server in the following format.
[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]
Generally, each []
item is separated by a space, and values that contain a space will be quoted. The [SIZE]
and [REFERRER]
can be -
or "-"
when the field is missing.
You won’t have to worry about most of these fields: for this task, we will focus on the [STATUS]
and [USER AGENT]
. Consider the following line, which has been reformatted and nicely-indented for clarity:
[04/Feb/2025:01:31:55 -0800]
"GET /courses/cse391/24su/css/base.css HTTP/1.1" 200 159760
"https://courses.cs.washington.edu/courses/cse391/25wi/"
"Mozilla/5.0 ... Safari/605.1.15"
courses.cs.washington.edu:443 - 6
[TIMESTAMP]
[04/Feb/2025:01:31:55 -0800]
[STATUS]
200
, an integer code used to signal (in this case) that the request was successfully served to the website visitor.[USER AGENT]
"Mozilla/5.0 ... Safari/605.1.15"
, indicating the website visitor’s browser platform.
Write your answers on the indicated lines in the task3.sh
file in the hw6
folder.
- A status code of
200
means that the request was successful. What is the shell statement that only output entries incommon_log
containing the number200
. - Searching for
200
will result in an overestimate since file paths can also trigger a match too in other columns like the year number (like2007
). What is the shell statement that only outputs entries incommon_log
that contain the status code200
. Web crawlers (“bots”) identify themselves using a very particular user agent. What is the shell statement that outputs all entries with a user agent that contains the characters
+http
, any other characters, the charactersbot
, any other characters aside from)
, and then a closing)
. For example, it should match+https://openai.com/gptbot)
in the following user agent:"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"
- How many unique bots are visiting the server (regardless of which page they requested)? Assuming bots are uniquely identified by the user agent rule that we described above (the text between the
+
and)
), what is the shell statement that outputs the number of unique bots have made requests to the CSE servers? Thegrep -o
flag may be of help to output only the matching text.