HW10 - More Shell Scripting (2 points)

Due Tuesday 12/10 at 1:00 pm. No late submissions accepted.

Submission: Gradescope

Specification: Spec


This assignment is focused on giving you more practice with Bash shell scripting. In particular, you’ll process a large amount of data (the common_log from HW6) using regular expressions and fundamental programming concepts in Bash. You’ll also learn a bit more about the common_log dataset (and how frequently bots are pinging the CSE web servers)!

To calculate your final score on this assignment, sum the individual scores from Gradescope.

  • if your score is between [0, 1), you will get 0 points
  • if your score is between [1.0, 1.5), you will get 1 point
  • if your score is between [1.5, 2], you will get 2 points

Assignment Overview

Similar to HW6 Task 3, this assignment will look at IP-anonymized snapshots of the CSE course web logs. In particular, you will write and turn in to Gradescope an analyzer.sh file that takes in a mandatory argument (the path to a log file) and an optional argument to produce some summary statistics about the data in the file.

After finishing all of the steps in the spec (“Main Program” and “Arguments”), running the script wite the common_log file from hw10:

$ ./analyzer.sh common_log

would produce the following output:

Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]

42518/54991 (77%) requests succeeded (200 OKs)

5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected

Top 3 crawlers:
   1688 http://www.bing.com/bingbot.htm
   1050 http://www.google.com/bot.html
    745 https://openai.com/gptbot

Top 3 most-crawled courses:
  14204 cse163
   4235 cse160
   1702 cse373

As another example, assume that you have the common_log file from hw6. Running the script with the optional number-to-show argument:

$ ./analyzer.sh common_log 5

would produce the following output:

Processing 12875 entries...
Time Range: [05/Nov/2024:07:00:00 -0800] - [05/Nov/2024:07:59:58 -0800]

8848/12875 (68%) requests succeeded (200 OKs)

3057/12875 (23%) requests were made by crawlers
15 unique crawlers were detected

Top 5 crawlers:
   1388 http://www.google.com/bot.html
    900 http://www.bing.com/bingbot.htm
    205 http://www.semrush.com/bot.html
    151 http://www.apple.com/go/applebot
    104 http://ahrefs.com/robot/

Top 5 most-crawled courses:
    978 cse154
    869 cse143
    625 cse546
    521 cse163
    433 cse373

Data Setup

When developing your solution, you’ll want to test it on some data. We’ll provide you two datasets, both taken from the real common_log file on the CSE web server.

The first is the common_log from 9:00-10:59 AM on December 2nd:

wget https://courses.cs.washington.edu/courses/cse391/24au/homework/hw10/common_log.zip
unzip common_log.zip

Another data source is available from HW6. If you haven’t already, clone the files from HW6 to get the common_log file from from 7:00-7:59 PM on November 5th:

wget https://courses.cs.washington.edu/courses/cse391/24au/homework/hw6/hw6.zip
unzip hw6.zip

common_log format

Info

This is the same format as HW6; it’s copied here (and slightly expanded upon) for your convenience.

Each line in the file common_log represents one request to a webpage on the CSE server. Roughly, its format looks like this:

[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]

Generally, each [] item is separated by a space (and values with a space in them are quoted). The [SIZE] and [REFERRER] can be - or "-" when the field is missing.

You won’t have to worry about most of these fields; for this task, we will focus on the [TIMESTAMP], [REQUEST], [STATUS], and [USER AGENT].

As an example example, consider the following line (indented for clarity):

[22/Jul/2024:14:12:33 -0700]
  "GET /courses/cse391/24au/css/base.css HTTP/1.1" 200 159760
  "https://courses.cs.washington.edu/courses/cse391/24au/"
  "Mozilla/5.0 ... Safari/605.1.15"
  courses.cs.washington.edu:443 - 6
  • the [TIMESTAMP] is [22/Jul/2024:14:12:33 -0700]
  • the [REQUEST] is "GET /courses/cse391/24au/css/base.css HTTP/1.1"
  • the [STATUS] is 200
  • the [USER AGENT] is "Mozilla/5.0 ... Safari/605.1.15"

You may additionally assume the following things, which are true in both of the provided datasets:

  • a [TIMESTAMP] always starts with [ and ends with ]
  • a [REQUEST] or [USER AGENT] is always enclosed with double quotes (")
  • a [STATUS] is always three digits

Main Program

We’ll now describe the steps to produce the main output of the program, broken into distinct parts. We suggest that you implement each part before moving on to the next one. Each part should be outputted sequentially, separated by an extra newline. By the end of this section, your program output should look like this:

$ analyser.sh common_log
Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]

42518/54991 (77%) requests succeeded (200 OKs)

5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected

Top 3 crawlers:
   1688 http://www.bing.com/bingbot.htm
   1050 http://www.google.com/bot.html
    745 https://openai.com/gptbot

Top 3 most-crawled courses:
  14204 cse163
   4235 cse160
   1702 cse373

For now, we will assume that analyzer.sh is called with exactly one argument: the path to a common_log file. You may safely assume that if an argument is provided, it is a valid common_log file; you do not need to check this.

Header (Warmup)

First, we just want to get some basic information about the common_log we’re analyzing.

In particular, have your script output the following details in two separate lines:

  1. the number of lines in the file
  2. the [TIMESTAMP] of the first and last lines in the file

For example, the header for the common_log provided in HW6 should be:

Processing 12875 entries...
Time Range: [05/Nov/2024:07:00:00 -0800] - [05/Nov/2024:07:59:58 -0800]

and the header for the common_log provided in HW10 should be:

Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]

Your analyzer.sh should output exactly the above lines for the two provided common_log files, including capitalization, spelling, and whitespace.

Info

📬 This question is graded by running your analyzer.sh on the two common_log files and looking at the exact output of your first two lines.

200 OKs

Next, you are interested in the number of requests that are successful, which you’ll estimate by looking for the number of lines that have a [STATUS] of exactly 200.

Output the total number of requests with a 200 status code, followed by the total number of requests, and then the percentage of requests with a 200 status code. For the purpose of this problem, you should calculate the percentage with the exact formula of 200_requests * 100 / total_requests.

Info

Why this formula? By default, bash doesn’t support floating-point division. This calculates the percentage using integer division instead.

For example, the output for the common_log provided in HW6 should be:

8848/12875 (68%) requests succeeded (200 OKs)

and the output for the common_log provided in HW10 should be:

42518/54991 (77%) requests succeeded (200 OKs)

Your analyzer.sh should output exactly the above lines for the two provided common_log files, including capitalization, spelling, and whitespace.

Info

📬 This question is graded by running your analyzer.sh on the two common_log files and looking at the exact output of the fourth line.

Bot Analysis, Part 1

Switching gears, we’ll once again look at crawlers. We’ll generate the same summary statistics that we calculated in HW6, but script this (rather than having to run multiple separate commands).

Add a feature to your script to output two lines:

  1. the number and percentage of requests made by a crawler
  2. the number of unique crawlers detected

Some implementation notes:

  • as a reminder, we said that a request was made by a crawler if its [USER AGENT] contains a string starting with +http and ending with (the first possible) ), where the text bot appears between the +http and )
  • for the purposes of this problem, a crawler is unique if the content between the +http and the first ) is unique — even if the resulting text isn’t strictly a URL
  • to calculate the percentage score, use the same formula as shown above (i.e. by multiplying by 100 first and then using integer division)

For example, the output for the common_log provided in HW6 should be:

3057/12875 (23%) requests were made by crawlers
15 unique crawlers were detected

and the output for the common_log provided in HW10 should be:

5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected

Your analyzer.sh should output exactly the above lines for the two provided common_log files, including capitalization, spelling, and whitespace.

Info

📬 This question is graded by running your analyzer.sh on the two common_log files and looking at the exact output of lines 6 and 7.

Bot Analysis, Part 2

Now, you want do some more research on the bots that are pinging our server (and maybe send their owners a bill for all the bandwidth they’re using)! To do that, you’ll need to know which bots are inducing the most load.

Add a feature to your script to output the “URLs” for the top 3 most common bots, sorted in descending order. For the purposes of this problem, the “URL” is the text between (but not including) the + and ).

For example, the output for the common_log provided in HW6 should be:

Top 3 crawlers:
   1388 http://www.google.com/bot.html
    900 http://www.bing.com/bingbot.htm
    205 http://www.semrush.com/bot.html

and the output for the common_log provided in HW10 should be:

Top 3 crawlers:
   1688 http://www.bing.com/bingbot.htm
   1050 http://www.google.com/bot.html
    745 https://openai.com/gptbot

Pay special attention to the whitespace (which is formatted by the output of a specific command and flag from this class).

Your analyzer.sh should output exactly the above lines for the two provided common_log files, including capitalization, spelling, and whitespace.

Info

📬 This question is graded by running your analyzer.sh on the two common_log files and looking at the exact output of lines 9-12.

Course Analysis

Finally, we’re curious which classes are most frequently pinged by crawlers. We should congratulate the instructors for teaching a popular class!

Add a feature to your script that outputs the top 3 most frequently appearing course codes. To determine the course code for a request, you should follow the exact steps:

  • find all entries that are GET requests.
  • For this problem, this is an entry where the [REQUEST] field starts with GET.
  • next, examine the path.
  • For this problem, this is the string that immediately follows the GET (and a space), starting with a / and ending (but not including) the first following space.
  • a path is tied to a course if (and only if): 1. the path starts with /courses/. 2. then, immediately has a course code, which starts with cse, then has three numbers, and then optionally one letter.
  • not all requests have course codes, and some contain course codes but not in the format/location described above. You should only include course codes produced in the exact format above.

For example, the output for the common_log provided in HW6 should be:

Top 3 most-crawled courses:
    978 cse154
    869 cse143
    625 cse546

and the output for the common_log provided in HW10 should be:

Top 3 most-crawled courses:
  14204 cse163
   4235 cse160
   1702 cse373

Pay special attention to the whitespace (which is formatted by the output of a specific command and flag from this class).

Your analyzer.sh should output exactly the above lines for the two provided common_log files, including capitalization, spelling, and whitespace.

Info

📬 This question is graded by running your analyzer.sh on the two common_log files and looking at the exact output of lines 14-17.

Arguments

By default, we’ve been assuming that one argument is passed to the script. To wrap things up, we’ll handle two other cases: when no arguments are provided, and when the user provides an extra argument specifying how many crawlers and courses to output.

No Arguments

If the user accidentally forgets to pass a common_log file, we shouldn’t do the above “main program”; instead, we should tell the user how to properly use our command and exit with a code of 1.

In particular, your program should have the exact output below when running ./analyzer.sh:

Usage: ./analyzer.sh LOGFILE [NUM_TO_SHOW]

Info

📬 This question is graded by running your analyzer.sh with no arguments and checking both the output and the exit code.

Optional Argument: NUM_TO_SHOW

If the user passes an additional argument to the script, we’ll interpret that as the number of crawlers and courses to show. This should change the number of crawlers and courses you output in “Bot Analysis, Part 2” and “Course Analysis”.

For example, the output for the common_log provided in HW6 with ./analyzer.sh common_log 5 should be:

Processing 12875 entries...
Time Range: [05/Nov/2024:07:00:00 -0800] - [05/Nov/2024:07:59:58 -0800]

8848/12875 (68%) requests succeeded (200 OKs)

3057/12875 (23%) requests were made by crawlers
15 unique crawlers were detected

Top 5 crawlers:
   1388 http://www.google.com/bot.html
    900 http://www.bing.com/bingbot.htm
    205 http://www.semrush.com/bot.html
    151 http://www.apple.com/go/applebot
    104 http://ahrefs.com/robot/

Top 5 most-crawled courses:
    978 cse154
    869 cse143
    625 cse546
    521 cse163
    433 cse373

Note that the lines say Top 5 crawlers and Top 5 most-crawled courses instead of Top 3 crawlers and Top 3 most-crawled courses. The rest of the output should not change.

You may assume that the user passes in a non-negative number. If there are less unique crawlers (or courses) than NUM_TO_SHOW, just output all of the crawlers (or courses).

Info

📬 This question is graded by running your analyzer.sh with two arguments (on both common_logs), testing 5, and 25 for the second argument, and comparing the entire output.

Tips, Tricks, and Things to Look Out For

Some general advice:

  • above all, please look at the spec definitions and output format carefully!
  • stuck? reach out and ask for help!!
  • while you do each part, we suggest frequently debugging with echo statements in the same way that you’d print-debug a Java program. Once you’re confident things work, then format them into what the spec is looking for.
  • prioritize correctness over efficiency; running the same command multiple times is completely okay!
  • this homework heavily leans on ideas from previous homeworks. If these feel shaky, we definitely suggest reviewing some of the commands we practiced in HW2, 3, 6, and 7.

And, some specific advice:

  • you may find yourself writing the same regex multiple times. Since regular expressions are just strings, you can store these as a variable and reuse them!
  • when doing arithmetic in Bash, remember that:
  • you need to use special syntax (either let or $(()))
  • the order of operations matters, since / is integer division!
  • whenever you use a variable within a string and/or as an argument, we suggest wrapping it in quotes (e.g. $argument). This is particularly helpful if the content of your variable has a space in it.
  • when using sed to edit text, recall that sed will not edit (i.e. keep) any text that is not captured. If you want to edit an entire line, you need to capture the entire line.
  • to print an empty line, just write echo (with no arguments).

Appendix: Using this on Real Data

Warning

This portion is not required for the homework; it’s just for fun!

If you’ve correctly implemented the homework, you can use this on live data from the CSE web logs!

As a reminder, the logs are hosted at /cse/web/courses/logs/common_log. You will need to do a bit of pre-processing:

  • you probably won’t be able to run analyzer.sh on the entire common_log file, since it’s quite large
  • you’ll want to remove the IPs (which we manually removed when providing you these submission files)

We suggest doing the following:

  1. first, use tail to get a subset of the log that you’re interested in (100000 seems like a reasonably-sized sample):
$ tail -n 100000 /cse/web/courses/logs/common_log > common_log
  1. next, remove the portions before the timestamp. You can do this with the following sed command:
$ sed -ri.bak 's/^[^[]*(\[.*)$/\1/' common_log
  1. finally, run your analyzer.sh:
$ ./analyzer.sh common_log
Processing 100000 entries...
Time Range: [07/Aug/2024:21:07:15 -0700] - [08/Aug/2024:01:57:54 -0700]

70802/100000 (70%) requests succeeded (200 OKs)

12531/100000 (12%) requests were made by crawlers
19 unique crawlers were detected

Top 3 crawlers:
   3107 https://openai.com/gptbot
   2952 http://www.google.com/bot.html
   1758 http://www.semrush.com/bot.html

Top 3 most-crawled courses:
  28603 cse163
   3332 cse341
   3332 cse154

Appendix: Methodology

Warning

This portion is not required for the homework; it’s just for fun!

As an aside, there are a few limitations with this analysis:

  • the way we look for successful requests is a non-trivial undercount: there are more successful status codes than just 200. If you’re curious, see List of HTTP status codes.
  • the way that we look for crawlers is a non-trivial undercount: many bots do not follow the format we described above.
  • For example,
    • some bots do not have URLs that start with http
    • some bots do not include the +
    • some bots do contain a +http and then eventually a ), but preceding the ) is more non-URL characters
    • some bots do not contain the word “bot”
    • some bots completely lie about who they are in the user agent (not much we can do here!)
  • Catching bots — both crawlers and other types of bots — is a real-world problem that many software engineers (and huge companies) work on. Generally speaking, it’s an arms race/cat-and-mouse game; people who write bots are frequently trying to evade detection!
  • the way that we count course codes is slightly incorrect, since:
  • not all requests are GET requests
  • some (very old) course website paths don’t exactly follow the /courses/COURSE_CODE format
  • not all CSE course codes follow the format we described above (e.g. CSE 390HA)

We chose to ignore these issues to make this a reasonably-scoped project. But, if you’re interested, you can dive more into the above problems and give them a shot!