Homework 6

Regular expressions

This assignment involves writing shell statements with regular expressions that precisely match (and only match) the requested lines. Your answers may also utilize input/output redirection operators such as >, <, and |. For this assignment, unless otherwise specified, “letters” should match both lowercase and uppercase letters.

Download the homework files on attu

After logging into attu, download the homework files.

git archive --remote=git@gitlab.cs.washington.edu:cse391/25wi/hw6.git --prefix=hw6/ HEAD | tar -x

Task 1: Grep with regexes

Let’s practice writing grep commands involving regular expressions. Write your answers on the indicated lines in the task1.sh file in the hw6 folder.

What is the grep command that matches all lines from names.txt that contain at least one numeric character?
What is the grep command that matches all lines from names.txt that are exactly 4 characters long and consist only of uppercase or lowercase characters?
What is the grep command that matches all lines from names.txt that look like a first and last name: two words separated by a single space where each word begins with an uppercase letter followed by one or more lowercase letters.

This last problem is intentionally flawed: writing a regular expression to capture all possible human names is a difficult if not impossible task. However, since many real-world systems use regular expressions like these to “validate” names (and all sorts of other personal information), it’s worth thinking about which assumptions we want to make! To quote Patrick McKenzie’s original post on this topic:

I have never seen a computer system which handles names properly and doubt one exists, anywhere.

Task 2: Validating input with regexes

After a few weeks at FAANG, management has discovered that we need to start actually selling products to stay in business! You’ve been tasked with spinning up our customer account creation and billing team. Write your answers on the indicated lines in the task2.sh file in the hw6 folder.

What is the grep command that matches all the valid usernames in usernames.txt, where a username is at least 3 letters, digits, ., -, or _. To match the literal character - in a character set, place it last as in [abcde-]. Escaping it with \- does not work!
What is the grep command that matches all the valid emails in emails.txt. Validating real email addresses is quite complicated, so a valid email address:
1. starts with between 1 to 16 letters or digits,
2. followed by the @ symbols,
3. followed by a domain like uw that consists of at least one lowercase letter,
4. followed by a period ., and
5. ending in a top-level domain like edu, that consists of 2 or more lowercase letters.
What is the grep command that matches all the strong passwords in passwords.txt, where a strong password contains:
- at least 12 characters,
- at least one uppercase characters,
- at least one lowercase characters,
- at least one digit, and
- any other characters beyond these requirements.
What is the grep command that matches all rewards card numbers in cards.txt. Whereas credit card validation will check the Luhn sum, our rewards card numbers match one of two patterns:
- any 16-digit number beginning with a 5 and grouped into sets of 4 digits separated by a space, or
- any 13-to-16-digit number beginning with a 4 and grouped into sets of 4 digits (where the last group may have fewer than 4 digits) separated by a space.
What is the grep command that matches all valid URLs in urls.txt. Validating real URLs is quite complicated, so a valid URL:
1. optionally starts with either http:// or https://,
2. followed by at least one domain-. pair of one or more lowercase letters followed by a ., as in cs.uw. or google.
3. followed by a top-level domain like edu that consists of 2 or more lowercase letters.

Task 3: Parsing CSE Web Logs with grep

Let’s use grep to parse an anonymized snapshot of the CSE course webserver logs. This is intended to model how we can use tools like grep and regular expressions to filter large amounts of data. Each line in the file common_log represents one request to a webpage on the CSE server in the following format.

[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]

Generally, each [] item is separated by a space, and values that contain a space will be quoted. The [SIZE] and [REFERRER] can be - or "-" when the field is missing.

You won’t have to worry about most of these fields: for this task, we will focus on the [STATUS] and [USER AGENT]. Consider the following line, which has been reformatted and nicely-indented for clarity:

[04/Feb/2025:01:31:55 -0800]
  "GET /courses/cse391/24su/css/base.css HTTP/1.1" 200 159760
  "https://courses.cs.washington.edu/courses/cse391/25wi/"
  "Mozilla/5.0 ... Safari/605.1.15"
  courses.cs.washington.edu:443 - 6

[TIMESTAMP]: [04/Feb/2025:01:31:55 -0800]
[STATUS]: 200, an integer code used to signal (in this case) that the request was successfully served to the website visitor.
[USER AGENT]: "Mozilla/5.0 ... Safari/605.1.15", indicating the website visitor’s browser platform.

Write your answers on the indicated lines in the task3.sh file in the hw6 folder.

A status code of 200 means that the request was successful. What is the shell statement that only output entries in common_log containing the number 200.
Searching for 200 will result in an overestimate since file paths can also trigger a match too in other columns like the year number (like 2007). What is the shell statement that only outputs entries in common_log that contain the status code 200.
Web crawlers (“bots”) identify themselves using a very particular user agent. What is the shell statement that outputs all entries with a user agent that contains the characters +http, any other characters, the characters bot, any other characters aside from ), and then a closing ). For example, it should match +https://openai.com/gptbot) in the following user agent:
```
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"
```
How many unique bots are visiting the server (regardless of which page they requested)? Assuming bots are uniquely identified by the user agent rule that we described above (the text between the + and )), what is the shell statement that outputs the number of unique bots have made requests to the CSE servers? The grep -o flag may be of help to output only the matching text.