24su ver.

Note: this is for the Summer 2024 iteration of CSE 391. Looking for a different quarter? Please visit https://courses.cs.washington.edu/courses/cse391/.

HW6 - Regular Expressions (2 points)

Due Tuesday 07/30 at 10:00 am. No late submissions accepted.

Submission: Gradescope

Specification: Spec


This assignment focuses on using regular expressions with grep. A set of files required for this assignment - including what you’ll submit - are available in the file hw6.zip:

wget https://courses.cs.washington.edu/courses/cse391/24su/homework/hw6/hw6.zip
unzip hw6.zip

You will submit three files to Gradescope: task1.sh, task2.sh, and task3.sh. For each task, determine a single shell statement that will perform the requested operation(s), not the output of the command. While each statement must be one line, you may use input/output redirection operators such as >, <, and |. Please be sure to write the entire statement in your solution, including any relevant input files.

To calculate your final score on this assignment, sum the individual scores from Gradescope.

  • if your score is between [0, 1), you will get 0 points
  • if your score is between [1.0, 1.5), you will get 1 point
  • if your score is between [1.5, 2], you will get 2 points

Task 1: grep -E warmup

First, we’ll do some warmup exercises using grep and regular expressions.

You may find it helpful to refer to the Regex Syntax on our reference page.

Problem 1

Write the command that matches all lines from names.txt that contain at least one numeric character.

Problem 2

Write the command that matches all lines from names.txt that are exactly 4 characters long and consist only of uppercase or lowercase characters.

Problem 3

Write the command that matches lines from names.txt that “look like” a first and last name. For the purposes of this problem, both a first and last name should start with an uppercase letter, followed by one or more lowercase letters; first and last names should be separated by a single space. Your command should not match lines that do not exactly follow this format.

Note!

This problem is intentionally flawed: writing a regular expression to capture all possible human names is a difficult task, if not impossible. However, many pieces of software do use regular expressions like these to “validate” names. It’s worth thinking about, especially as you’ll encounter this in the future (either as a user or developer of software).

To quote Patrick McKenzie’s original post on this topic: “I have never seen a computer system which handles names properly and doubt one exists, anywhere.”

Task 2: Validating input with regexes

After a few weeks at FAANG, management has discovered that we need to start actually selling products to stay in business! You’ve been tasked with spinning up our customer account creation and billing team. It turns out that regular expressions are extremely helpful in validating user input - so we’ll give them a spin!

Problem 4: Username Validation

FAANG users sign up for accounts with usernames. After talking with the rest of your team, you’ve decided that usernames should:

  • be at least three characters
  • only contain (lowercase or uppercase) letters, digits, ., -, and _

Write a grep command that matches all (and only all) lines in usernames.txt that are valid usernames.

Tip

To match the literal character - in a character set, place it last, e.g. [abcde-]. Escaping it with \- does not work!

Problem 5: Email Validation

In addition to a username, customers must also provide a valid email address. You’ve heard that email address validation is actually very complicated, but the spec seems too long so you’ve decided to use a simplified version of a valid email address.

For the purposes of FAANG, a valid email address contains each of the following elements, in order:

  1. between 1 to 16 characters that are a lowercase letter, uppercase letter, or digit
  2. then, the @ symbol
  3. then, a domain (e.g. gmail) that consists of at least one lowercase letter
  4. then, a period (i.e. .)
  5. finally, a “top-level domain” or TLD (e.g. com), that consists of 2 or more lowercase letters

Write a grep command that matches all (and only all) lines in emails.txt that are valid emails.

Problem 6: Password Validation

We take security very seriously at FAANG. So, customers need to choose a strong password when creating an account. FAANG’s security experts have decided that a strong password must:

  • be at least 12 characters long
  • contain at least one uppercase character
  • contain at least one lowercase character
  • contain at least one digit

However, passwords may contain other characters as well.

Write a grep command that matches all (and only all) lines in passwords.txt that are strong passwords.

Tip

This is a very simplistic view of what a strong password actually is. Take a security class to learn more!

Problem 7: Credit Card Validation

To process payments, FAANG needs to validate credit cards entered on the site. As a startup, we only accept Visa and Mastercard (Amex’s fees are too high!!). Visa and Mastercard have different credit card formats:

  • Mastercard
    • begins with a 5
    • exactly 16 digits long, grouped into sets of 4 digits separated by a space
  • Visa
    • begins with a 4
    • between 13 and 16 digits long (inclusive), grouped into sets of 4 digits (last may be shorter) separated by a space

Write a grep command that matches all (and only all) lines in cards.txt that are valid Mastercard or Visa credit card numbers.

Fun Fact

Real-life credit card validation also involves a formula that helps prevent accidental errors; you do not need to do that for this assignment.

Problem 8: URL Validation

FAANG also lets users add personal websites to their profile page (why? who knows?). But, they want to make sure that these links work.

The spec for URLs also seems to complicated, so you’ve once again opted for a simple version. For FAANG users, a valid URL:

  1. optionally starts with either http:// or https://
  2. then, has at least one domain-. pair (e.g. cs.uw. or google.), where a domain-. pair has:
    1. a domain, which is one or more lowercase letters
    2. then, a period (i.e. .)
  3. then, a “top-level domain” or TLD (e.g. com), that consists of 2 or more lowercase letters
  4. optionally, a trailing /

Write a grep command that matches all (and only all) lines in urls.txt that are valid URLs.

Tip

You can put all sorts of complicated things in capture groups…

Task 3: Parsing CSE Web Logs with grep

In this task, you’ll use grep to parse an IP-anonymized snapshot of the CSE course web logs (from 4:00-5:59 PM on July 22nd). This is intended to model how software engineers use tools like grep and regular expressions to filter through large amounts of data.

Each line in the file common_log represents one request to a webpage on the CSE server. Roughly, its format looks like this:

[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]

Generally, each [] item is separated by a space (and values with a space in them are quoted). The [SIZE] and [REFERRER] can be - or "-" when the field is missing.

You won’t have to worry about most of these fields; for this task, we will focus on the [STATUS] and [USER AGENT].

As an example example, consider the following line (indented for clarity):

[22/Jul/2024:14:12:33 -0700]
  "GET /courses/cse391/24su/css/base.css HTTP/1.1" 200 159760
  "https://courses.cs.washington.edu/courses/cse391/24su/"
  "Mozilla/5.0 ... Safari/605.1.15"
  courses.cs.washington.edu:443 - 6
  • the [TIMESTAMP] is [22/Jul/2024:14:12:33 -0700]
  • the [STATUS] is 200
  • the [USER AGENT] is "Mozilla/5.0 ... Safari/605.1.15"

Problem 9: Serving Statuses

A status code of 200 means that the request was successful. Write a single-line shell statement that only output entries in common_log that contain the number 200.

Hint: you should get 30420 entries.

Problem 10: Serving Statuses, Specifically

Oh no! You realized that the number 200 appears elsewhere, like in file paths that contain years (e.g. 2007). Write a new single-line shell statement that only outputs entries in common_log that contain the status code of 200.

Hints:

  • take advantage of the fact that you know the format of the log. How does this help you?
  • you should get 30373 entries.

Problem 11: Searching Salacious Scrapers

Web servers (such as CSE’s) are frequently hit by web scrapers; this is even more common with generative AI’s popularity. For example, the log contains many requests from crawlers made by Google, OpenAI, and Bing. CSE IT is interested in seeing how many times these scrapers (or “bots”) ping our webserver.

Unfortunately, greping for just bot doesn’t work: bot is a substring in other file paths.

After doing some research you’ve learned that crawlers are supposed to identify if they are a bot in the [USER AGENT] field. In particular, each bot has a URL between a +http and a ) in the user agent field with the text “bot” somewhere.

For example, see the +https://openai.com/gptbot) in the following user agent string:

"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"

Write a single-line shell statement that outputs all entries with a [USER AGENT] matching this exact pattern (i.e. containing a +http, some characters, and then a closing )).

Hints:

  • the negation meaning of ^ is your friend :)
  • to help you double-check your work, you should be getting 14596 entries.

Problem 12: Summing Scrapers

IT wants to research each of the individual bots that are scraping the CSE website. In particular, they want to know how many different bots are hitting the server (regardless of which page they requested).

For this problem, assume that bots are unique to their URL (the text between the + and ) in the previous problem).

Write a single-line shell statement that outputs the number of unique bots have made requests to the CSE servers.

Three hints:

  • the -o flag of grep outputs only what text was matched.
  • remember what we saw in previous HWs on using sort, uniq, and wc together - both with respect to ordering, and building the solution up gradually!
  • the number should be much, much lower than 14596!

(as an aside: all of these links should lead to valid websites. Do you recognize these companies?)