Homework 10
More Shell Scripting
This assignment involves more practice with Bash shell scripting to write an analyzer.sh
that prints summary statistics about the data in a common_log
dataset of anonymized CSE course webserver logs. You’ll process a large amount of data from the common_log
using regular expressions, Bash programming concepts, and learn more about the common_log
dataset. Here is an example output for ./analyzer.sh common_log
:
Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]
42518/54991 (77%) requests succeeded (200 OKs)
5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected
Top 3 crawlers:
1688 http://www.bing.com/bingbot.htm
1050 http://www.google.com/bot.html
745 https://openai.com/gptbot
Top 3 most-crawled courses:
14204 cse163
4235 cse160
1702 cse373
Running the same script with the optional number-to-show argument, ./analyzer.sh common_log 5
, would print:
Processing 12875 entries...
Time Range: [05/Nov/2024:07:00:00 -0800] - [05/Nov/2024:07:59:58 -0800]
8848/12875 (68%) requests succeeded (200 OKs)
3057/12875 (23%) requests were made by crawlers
15 unique crawlers were detected
Top 5 crawlers:
1388 http://www.google.com/bot.html
900 http://www.bing.com/bingbot.htm
205 http://www.semrush.com/bot.html
151 http://www.apple.com/go/applebot
104 http://ahrefs.com/robot/
Top 5 most-crawled courses:
978 cse154
869 cse143
625 cse546
521 cse163
433 cse373
Download the homework files on attu
After logging into attu
, download the homework files.
git archive --remote=git@gitlab.cs.washington.edu:cse391/25wi/hw10.git --prefix=hw10/ HEAD | tar -x
You may also find it useful to refer to the earlier common_log
file captured from HW6.
Log format
As we learned earlier, each line in the file common_log
represents one request to a webpage on the CSE server in the following format.
[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]
Generally, each []
item is separated by a space, and values that contain a space will be quoted. The [SIZE]
and [REFERRER]
can be -
or "-"
when the field is missing.
You won’t have to worry about most of these fields: for this task, we will focus on the [STATUS]
and [USER AGENT]
. Consider the following line, which has been reformatted and nicely-indented for clarity:
[04/Feb/2025:01:31:55 -0800]
"GET /courses/cse391/24su/css/base.css HTTP/1.1" 200 159760
"https://courses.cs.washington.edu/courses/cse391/25wi/"
"Mozilla/5.0 ... Safari/605.1.15"
courses.cs.washington.edu:443 - 6
[TIMESTAMP]
[04/Feb/2025:01:31:55 -0800]
[STATUS]
200
, an integer code used to signal (in this case) that the request was successfully served to the website visitor.[USER AGENT]
"Mozilla/5.0 ... Safari/605.1.15"
, indicating the website visitor’s browser platform.
Step 1: Print basic information
Let’s get started with analyzing the common_log
by writing code to print the first three lines.
Processing 54991 entries...
Time Range: [02/Dec/2024:09:00:00 -0800] - [02/Dec/2024:10:59:12 -0800]
Your analyzer.sh
should print lines in exactly the above format, including capitalization, spelling, and whitespace. To print the empty third line, just invoke echo
with no arguments.
Step 2: Count 200 OKs
Next, let’s count the number of successful requests, which you’ll estimate by looking for the number of lines that have status code 200
.
42518/54991 (77%) requests succeeded (200 OKs)
For the purpose of this problem, calculate the percentage using the formula 200_requests * 100 / total_requests
. Why this formula? By default, bash
doesn’t support floating-point division. This calculates the percentage using integer division instead.
Step 3: Count crawlers (bots)
Switching gears, we’ll once again look at crawlers (bots) like we did in HW6. Add a feature to your script to print two lines:
5589/54991 (10%) requests were made by crawlers
18 unique crawlers were detected
A request is made by a crawler if its [USER AGENT]
contains a string starting with +http
and ending with (the first possible) )
, where the text bot
appears between the +http
and )
. A crawler is considered unique if the content between the +http
and the first )
is unique—even if the resulting text isn’t strictly a URL.
To calculate the percentage score, use the same formula that we defined for counting successful requests.
Step 4: Identify top crawlers
Let’s identify the top crawlers that are making requests—and maybe send their owners a bill for all the bandwidth they’re using! Add a feature to your script to print the URLs for the top 3 most common bots, sorted in descending order. For the purpose of this problem, the URL is the text between (but not including) the +
and )
.
Top 3 crawlers:
1688 http://www.bing.com/bingbot.htm
1050 http://www.google.com/bot.html
745 https://openai.com/gptbot
Pay special attention to the whitespace, which is formatted by the output of a specific command and flag we learned in class.
Step 5: Identify most-crawled courses
Finally, let’s see which courses are most frequently pinged by crawlers. Add a feature to your script that prints the top 3 most frequently appearing course codes:
Top 3 most-crawled courses:
14204 cse163
4235 cse160
1702 cse373
- Find all entries that are
GET
requests: all entries where the[REQUEST]
field starts withGET
. - Examine the path: the string that immediately follows the
GET
(and a space), starting with a/
and ending (but not including) the first following space. - A path is tied to a course if (and only if) the path matches the format
/courses/
followed by the course code:cse
, three digits, and (optionally) one letter.
Not all requests have course codes, and some contain course codes but not in the format/location described above. Include only the course codes that follow the exact format specified above. As in the prior step, pay special attention to the whitespace.
Step 6: Handle optional argument
By default, we’ve been assuming that one argument is passed to the script. Let’s handle two other cases: when no arguments are provided, and when the user provides an extra argument specifying how many crawlers and courses to output.
If the user accidentally forgets to pass a common_log
file, report the problem and exit with code 1
. The following output should appear when running ./analyzer.sh
:
Usage: ./analyzer.sh LOGFILE [NUM_TO_SHOW]
If the user passes an additional argument to the script, we’ll interpret that as the number of crawlers and courses to show. Assume the NUM_TO_SHOW
is a non-negative integer. If there are fewer unique crawlers (or courses) than NUM_TO_SHOW
, just output all the crawlers (or courses). Remember to update the header line, Top X crawlers
and Top X most-crawled courses
.