Homework 10

More Shell Scripting

This assignment involves more practice with Bash shell scripting to write an analyzer.sh that prints summary statistics about the data in a common_log dataset of anonymized CSE course webserver logs. You’ll process a large amount of data from the common_log using regular expressions, Bash programming concepts, and learn more about the common_log dataset. Here is an example output for ./analyzer.sh common_log:

Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]

42518/54991 (77%) requests succeeded (200 OKs)

5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected

Top 3 crawlers:
   1688 http://www.bing.com/bingbot.htm
   1050 http://www.google.com/bot.html
    745 https://openai.com/gptbot

Top 3 most-crawled courses:
  14204 cse163
   4235 cse160
   1702 cse373

Running the same script with the optional number-to-show argument, ./analyzer.sh common_log 5, would print:

Processing 12875 entries...
Time Range: [05/Nov/2024:07:00:00 -0800] - [05/Nov/2024:07:59:58 -0800]

8848/12875 (68%) requests succeeded (200 OKs)

3057/12875 (23%) requests were made by crawlers
15 unique crawlers were detected

Top 5 crawlers:
   1388 http://www.google.com/bot.html
    900 http://www.bing.com/bingbot.htm
    205 http://www.semrush.com/bot.html
    151 http://www.apple.com/go/applebot
    104 http://ahrefs.com/robot/

Top 5 most-crawled courses:
    978 cse154
    869 cse143
    625 cse546
    521 cse163
    433 cse373

Download the homework files on attu

After logging into attu, download the homework files.

git archive --remote=git@gitlab.cs.washington.edu:cse391/25wi/hw10.git --prefix=hw10/ HEAD | tar -x

You may also find it useful to refer to the earlier common_log file captured from HW6.

Log format

As we learned earlier, each line in the file common_log represents one request to a webpage on the CSE server in the following format.

[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]

Generally, each [] item is separated by a space, and values that contain a space will be quoted. The [SIZE] and [REFERRER] can be - or "-" when the field is missing.

You won’t have to worry about most of these fields: for this task, we will focus on the [STATUS] and [USER AGENT]. Consider the following line, which has been reformatted and nicely-indented for clarity:

[04/Feb/2025:01:31:55 -0800]
  "GET /courses/cse391/24su/css/base.css HTTP/1.1" 200 159760
  "https://courses.cs.washington.edu/courses/cse391/25wi/"
  "Mozilla/5.0 ... Safari/605.1.15"
  courses.cs.washington.edu:443 - 6

[TIMESTAMP]: [04/Feb/2025:01:31:55 -0800]
[STATUS]: 200, an integer code used to signal (in this case) that the request was successfully served to the website visitor.
[USER AGENT]: "Mozilla/5.0 ... Safari/605.1.15", indicating the website visitor’s browser platform.

Step 1: Print basic information

Let’s get started with analyzing the common_log by writing code to print the first three lines.

Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]

Your analyzer.sh should print lines in exactly the above format, including capitalization, spelling, and whitespace. To print the empty third line, just invoke echo with no arguments.

Step 2: Count 200 OKs

Next, let’s count the number of successful requests, which you’ll estimate by looking for the number of lines that have status code 200.

42518/54991 (77%) requests succeeded (200 OKs)

For the purpose of this problem, calculate the percentage using the formula 200_requests * 100 / total_requests. Why this formula? By default, bash doesn’t support floating-point division. This calculates the percentage using integer division instead.

Step 3: Count crawlers (bots)

Switching gears, we’ll once again look at crawlers (bots) like we did in HW6. Add a feature to your script to print two lines:

5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected

A request is made by a crawler if its [USER AGENT] contains a string starting with +http and ending with (the first possible) ), where the text bot appears between the +http and ). A crawler is considered unique if the content between the +http and the first ) is unique—even if the resulting text isn’t strictly a URL.

To calculate the percentage score, use the same formula that we defined for counting successful requests.

Step 4: Identify top crawlers

Let’s identify the top crawlers that are making requests—and maybe send their owners a bill for all the bandwidth they’re using! Add a feature to your script to print the URLs for the top 3 most common bots, sorted in descending order. For the purpose of this problem, the URL is the text between (but not including) the + and ).

Top 3 crawlers:
http://www.bing.com/bingbot.htm
http://www.google.com/bot.html
https://openai.com/gptbot

Pay special attention to the whitespace, which is formatted by the output of a specific command and flag we learned in class.

Step 5: Identify most-crawled courses

Finally, let’s see which courses are most frequently pinged by crawlers. Add a feature to your script that prints the top 3 most frequently appearing course codes:

Top 3 most-crawled courses:
cse163
cse160
cse373

Find all entries that are GET requests: all entries where the [REQUEST] field starts with GET.
Examine the path: the string that immediately follows the GET (and a space), starting with a / and ending (but not including) the first following space.
A path is tied to a course if (and only if) the path matches the format /courses/ followed by the course code: cse, three digits, and (optionally) one letter.

Not all requests have course codes, and some contain course codes but not in the format/location described above. Include only the course codes that follow the exact format specified above. As in the prior step, pay special attention to the whitespace.

Step 6: Handle optional argument

By default, we’ve been assuming that one argument is passed to the script. Let’s handle two other cases: when no arguments are provided, and when the user provides an extra argument specifying how many crawlers and courses to output.

If the user accidentally forgets to pass a common_log file, report the problem and exit with code 1. The following output should appear when running ./analyzer.sh:

Usage: ./analyzer.sh LOGFILE [NUM_TO_SHOW]

If the user passes an additional argument to the script, we’ll interpret that as the number of crawlers and courses to show. Assume the NUM_TO_SHOW is a non-negative integer. If there are fewer unique crawlers (or courses) than NUM_TO_SHOW, just output all the crawlers (or courses). Remember to update the header line, Top X crawlers and Top X most-crawled courses.