24su ver.
Note: this is for the Summer 2024 iteration of CSE 391. Looking for a different quarter? Please visit https://courses.cs.washington.edu/courses/cse391/.
Due Friday 08/16 at 11:59 pm. No late submissions accepted.
Submission: Gradescope
Specification: Spec
This assignment is focused on giving you more practice with Bash shell scripting. In particular, you’ll process a large amount of data (the common_log
from HW6) using regular expressions and fundamental programming concepts in Bash. You’ll also learn a bit more about the common_log
dataset (and how frequently bots are pinging the CSE web servers)!
To calculate your final score on this assignment, sum the individual scores from Gradescope.
- if your
score
is between[0, 1)
, you will get0
points - if your
score
is between[1.0, 1.5)
, you will get1
point - if your
score
is between[1.5, 2]
, you will get2
points
Assignment Overview¶
Similar to HW6 Task 3, this assignment will look at IP-anonymized snapshots of the CSE course web logs. In particular, you will write and turn in to Gradescope an analyzer.sh
file that takes in a mandatory argument (the path to a log file) and an optional argument to produce some summary statistics about the data in the file.
After finishing all of the steps in the spec (“Main Program” and “Arguments”), running the script wite the common_log
file from hw9
:
$ ./analyzer.sh common_log
would produce the following output:
Processing 27545 entries...
Time Range: [07/Aug/2024:18:00:00 -0700] - [07/Aug/2024:18:59:59 -0700]
19962/27545 (72%) requests succeeded (200 OKs)
2990/27545 (10%) requests were made by crawlers
14 unique crawlers were detected
Top 3 crawlers:
1116 http://www.google.com/bot.html
601 http://www.bing.com/bingbot.htm
378 http://www.semrush.com/bot.html
Top 3 most-crawled courses:
6290 cse163
1056 cse333
923 cse331
As another example, assume that you have the common_log
file from hw6
. Running the script with the optional number-to-show argument:
$ ./analyzer.sh common_log 5
would produce the following output:
Processing 42648 entries...
Time Range: [22/Jul/2024:14:00:00 -0700] - [22/Jul/2024:17:59:59 -0700]
30373/42648 (71%) requests succeeded (200 OKs)
14596/42648 (34%) requests were made by crawlers
17 unique crawlers were detected
Top 5 crawlers:
4012 http://www.bing.com/bingbot.htm
3871 https://developer.amazon.com/support/amazonbot
1435 http://www.google.com/bot.html
1375 http://www.semrush.com/bot.html
1197 http://ahrefs.com/robot/
Top 5 most-crawled courses:
3251 cse163
2568 cse190m
1603 cse373
1599 cse544
1581 cse351
Data Setup¶
When developing your solution, you’ll want to test it on some data. We’ll provide you two datasets, both taken from the real common_log
file on the CSE web server.
The first is the common_log
from 6:00-6:59 PM on August 7th:
wget https://courses.cs.washington.edu/courses/cse391/24su/homework/hw9/common_log.zip
unzip common_log.zip
Another data source is available from HW6. If you haven’t already, clone the files from HW6 to get the common_log
file from from 4:00-5:59 PM on July 22nd:
wget https://courses.cs.washington.edu/courses/cse391/24su/homework/hw6/hw6.zip
unzip hw6.zip
common_log
format¶
Info
This is the same format as HW6; it’s copied here (and slightly expanded upon) for your convenience.
Each line in the file common_log
represents one request to a webpage on the CSE server. Roughly, its format looks like this:
[TIMESTAMP] [REQUEST] [STATUS] [SIZE] [REFERRER] [USER AGENT] [SERVER] - [N]
Generally, each []
item is separated by a space (and values with a space in them are quoted). The [SIZE]
and [REFERRER]
can be -
or "-"
when the field is missing.
You won’t have to worry about most of these fields; for this task, we will focus on the [TIMESTAMP]
, [REQUEST]
, [STATUS]
, and [USER AGENT]
.
As an example example, consider the following line (indented for clarity):
[22/Jul/2024:14:12:33 -0700]
"GET /courses/cse391/24su/css/base.css HTTP/1.1" 200 159760
"https://courses.cs.washington.edu/courses/cse391/24su/"
"Mozilla/5.0 ... Safari/605.1.15"
courses.cs.washington.edu:443 - 6
- the
[TIMESTAMP]
is[22/Jul/2024:14:12:33 -0700]
- the
[REQUEST]
is"GET /courses/cse391/24su/css/base.css HTTP/1.1"
- the
[STATUS]
is200
- the
[USER AGENT]
is"Mozilla/5.0 ... Safari/605.1.15"
You may additionally assume the following things, which are true in both of the provided datasets:
- a
[TIMESTAMP]
always starts with[
and ends with]
- a
[REQUEST]
or[USER AGENT]
is always enclosed with double quotes ("
) - a
[STATUS]
is always three digits
Main Program¶
We’ll now describe the steps to produce the main output of the program, broken into distinct parts. We suggest that you implement each part before moving on to the next one. Each part should be outputted sequentially, separated by an extra newline. By the end of this section, your program output should look like this:
$ analyser.sh common_log
Processing 27545 entries...
Time Range: [07/Aug/2024:18:00:00 -0700] - [07/Aug/2024:18:59:59 -0700]
19962/27545 (72%) requests succeeded (200 OKs)
2990/27545 (10%) requests were made by crawlers
14 unique crawlers were detected
Top 3 crawlers:
1116 http://www.google.com/bot.html
601 http://www.bing.com/bingbot.htm
378 http://www.semrush.com/bot.html
Top 3 most-crawled courses:
6290 cse163
1056 cse333
923 cse331
For now, we will assume that analyzer.sh
is called with exactly one argument: the path to a common_log
file. You may safely assume that if an argument is provided, it is a valid common_log
file; you do not need to check this.
Header (Warmup)¶
First, we just want to get some basic information about the common_log
we’re analyzing.
In particular, have your script output the following details in two separate lines:
- the number of lines in the file
- the
[TIMESTAMP]
of the first and last lines in the file
For example, the header for the common_log
provided in HW6 should be:
Processing 42648 entries...
Time Range: [22/Jul/2024:14:00:00 -0700] - [22/Jul/2024:17:59:59 -0700]
and the header for the common_log
provided in HW9 should be:
Processing 27545 entries...
Time Range: [07/Aug/2024:18:00:00 -0700] - [07/Aug/2024:18:59:59 -0700]
Your analyzer.sh
should output exactly the above lines for the two provided common_log
files, including capitalization, spelling, and whitespace.
Info
📬 This question is graded by running your analyzer.sh
on the two common_log
files and looking at the exact output of your first two lines.
200 OKs¶
Next, you are interested in the number of requests that are successful, which you’ll estimate by looking for the number of lines that have a [STATUS]
of exactly 200
.
Output the total number of requests with a 200 status code, followed by the total number of requests, and then the percentage of requests with a 200 status code. For the purpose of this problem, you should calculate the percentage with the exact formula of 200_requests * 100 / total_requests
.
Info
Why this formula? By default, bash
doesn’t support floating-point division. This calculates the percentage using integer division instead.
For example, the output for the common_log
provided in HW6 should be:
30373/42648 (71%) requests succeeded (200 OKs)
and the output for the common_log
provided in HW9 should be:
19962/27545 (72%) requests succeeded (200 OKs)
Your analyzer.sh
should output exactly the above lines for the two provided common_log
files, including capitalization, spelling, and whitespace.
Info
📬 This question is graded by running your analyzer.sh
on the two common_log
files and looking at the exact output of the fourth line.
Bot Analysis, Part 1¶
Switching gears, we’ll once again look at crawlers. We’ll generate the same summary statistics that we calculated in HW6, but script this (rather than having to run multiple separate commands).
Add a feature to your script to output two lines:
- the number and percentage of requests made by a crawler
- the number of unique crawlers detected
Some implementation notes:
- as a reminder, we said that a request was made by a crawler if its
[USER AGENT]
contains a string starting with+http
and ending with (the first possible))
, where the textbot
appears between the+http
and)
- for the purposes of this problem, a crawler is unique if the content between the
+http
and the first)
is unique — even if the resulting text isn’t strictly a URL - to calculate the percentage score, use the same formula as shown above (i.e. by multiplying by
100
first and then using integer division)
For example, the output for the common_log
provided in HW6 should be:
14596/42648 (34%) requests were made by crawlers
17 unique crawlers were detected
and the output for the common_log
provided in HW9 should be:
2990/27545 (10%) requests were made by crawlers
14 unique crawlers were detected
Your analyzer.sh
should output exactly the above lines for the two provided common_log
files, including capitalization, spelling, and whitespace.
Info
📬 This question is graded by running your analyzer.sh
on the two common_log
files and looking at the exact output of lines 6 and 7.
Bot Analysis, Part 2¶
Now, you want do some more research on the bots that are pinging our server (and maybe send their owners a bill for all the bandwidth they’re using)! To do that, you’ll need to know which bots are inducing the most load.
Add a feature to your script to output the “URLs” for the top 3 most common bots, sorted in descending order. For the purposes of this problem, the “URL” is the text between (but not including) the +
and )
.
For example, the output for the common_log
provided in HW6 should be:
Top 3 crawlers:
4012 http://www.bing.com/bingbot.htm
3871 https://developer.amazon.com/support/amazonbot
1435 http://www.google.com/bot.html
and the output for the common_log
provided in HW9 should be:
Top 3 crawlers:
1116 http://www.google.com/bot.html
601 http://www.bing.com/bingbot.htm
378 http://www.semrush.com/bot.html
Pay special attention to the whitespace (which is formatted by the output of a specific command and flag from this class).
Your analyzer.sh
should output exactly the above lines for the two provided common_log
files, including capitalization, spelling, and whitespace.
Info
📬 This question is graded by running your analyzer.sh
on the two common_log
files and looking at the exact output of lines 9-12.
Course Analysis¶
Finally, we’re curious which classes are most frequently pinged by crawlers. We should congratulate the instructors for teaching a popular class!
Add a feature to your script that outputs the top 3 most frequently appearing course codes. To determine the course code for a request, you should follow the exact steps:
- find all entries that are
GET
requests.- For this problem, this is an entry where the
[REQUEST]
field starts withGET
.
- For this problem, this is an entry where the
- next, examine the path.
- For this problem, this is the string that immediately follows the
GET
(and a space), starting with a/
and ending (but not including) the first following space.
- For this problem, this is the string that immediately follows the
- a path is tied to a course if (and only if):
- the path starts with
/courses/
. - then, immediately has a course code, which starts with
cse
, then has three numbers, and then optionally one letter.
- the path starts with
- not all requests have course codes, and some contain course codes but not in the format/location described above. You should only include course codes produced in the exact format above.
For example, the output for the common_log
provided in HW6 should be:
Top 3 most-crawled courses:
3251 cse163
2568 cse190m
1603 cse373
and the output for the common_log
provided in HW9 should be:
Top 3 most-crawled courses:
6290 cse163
1056 cse333
923 cse331
Pay special attention to the whitespace (which is formatted by the output of a specific command and flag from this class).
Your analyzer.sh
should output exactly the above lines for the two provided common_log
files, including capitalization, spelling, and whitespace.
Info
📬 This question is graded by running your analyzer.sh
on the two common_log
files and looking at the exact output of lines 14-17.
Arguments¶
By default, we’ve been assuming that one argument is passed to the script. To wrap things up, we’ll handle two other cases: when no arguments are provided, and when the user provides an extra argument specifying how many crawlers and courses to output.
No Arguments¶
If the user accidentally forgets to pass a common_log
file, we shouldn’t do the above “main program”; instead, we should tell the user how to properly use our command and exit with a code of 1
.
In particular, your program should have the exact output below when running ./analyzer.sh
:
Usage: ./analyzer.sh LOGFILE [NUM_TO_SHOW]
Info
📬 This question is graded by running your analyzer.sh
with no arguments and checking both the output and the exit code.
Optional Argument: NUM_TO_SHOW
¶
If the user passes an additional argument to the script, we’ll interpret that as the number of crawlers and courses to show. This should change the number of crawlers and courses you output in “Bot Analysis, Part 2” and “Course Analysis”.
For example, the output for the common_log
provided in HW6 with ./analyzer.sh common_log 5
should be:
Processing 42648 entries...
Time Range: [22/Jul/2024:14:00:00 -0700] - [22/Jul/2024:17:59:59 -0700]
30373/42648 (71%) requests succeeded (200 OKs)
14596/42648 (34%) requests were made by crawlers
17 unique crawlers were detected
Top 5 crawlers:
4012 http://www.bing.com/bingbot.htm
3871 https://developer.amazon.com/support/amazonbot
1435 http://www.google.com/bot.html
1375 http://www.semrush.com/bot.html
1197 http://ahrefs.com/robot/
Top 5 most-crawled courses:
3251 cse163
2568 cse190m
1603 cse373
1599 cse544
1581 cse351
Note that the lines say Top 5 crawlers
and Top 5 most-crawled courses
instead of Top 3 crawlers
and Top 3 most-crawled courses
. The rest of the output should not change.
You may assume that the user passes in a non-negative number. If there are less unique crawlers (or courses) than NUM_TO_SHOW
, just output all of the crawlers (or courses).
Info
📬 This question is graded by running your analyzer.sh
with two arguments (on both common_log
s), testing 5
, and 25
for the second argument, and comparing the entire output.
Tips, Tricks, and Things to Look Out For¶
Some general advice:
- above all, please look at the spec definitions and output format carefully!
- stuck? reach out and ask for help!!
- while you do each part, we suggest frequently debugging with
echo
statements in the same way that you’d print-debug a Java program. Once you’re confident things work, then format them into what the spec is looking for. - prioritize correctness over efficiency; running the same command multiple times is completely okay!
- this homework heavily leans on ideas from previous homeworks. If these feel shaky, we definitely suggest reviewing some of the commands we practiced in HW2, 3, 6, and 7.
And, some specific advice:
- you may find yourself writing the same regex multiple times. Since regular expressions are just strings, you can store these as a variable and reuse them!
- when doing arithmetic in Bash, remember that:
- you need to use special syntax (either
let
or$(())
) - the order of operations matters, since
/
is integer division!
- you need to use special syntax (either
- whenever you use a variable within a string and/or as an argument, we suggest wrapping it in quotes (e.g.
$argument
). This is particularly helpful if the content of your variable has a space in it. - when using
sed
to edit text, recall thatsed
will not edit (i.e. keep) any text that is not captured. If you want to edit an entire line, you need to capture the entire line. - to print an empty line, just write
echo
(with no arguments).
Appendix: Using this on Real Data¶
Warning
This portion is not required for the homework; it’s just for fun!
If you’ve correctly implemented the homework, you can use this on live data from the CSE web logs!
As a reminder, the logs are hosted at /cse/web/courses/logs/common_log
. You will need to do a bit of pre-processing:
- you probably won’t be able to run
analyzer.sh
on the entirecommon_log
file, since it’s quite large - you’ll want to remove the IPs (which we manually removed when providing you these submission files)
We suggest doing the following:
- first, use
tail
to get a subset of the log that you’re interested in (100000
seems like a reasonably-sized sample):
$ tail -n 100000 /cse/web/courses/logs/common_log > common_log
- next, remove the portions before the timestamp. You can do this with the following
sed
command:
$ sed -ri.bak 's/^[^[]*(\[.*)$/\1/' common_log
- finally, run your
analyzer.sh
:
$ ./analyzer.sh common_log
Processing 100000 entries...
Time Range: [07/Aug/2024:21:07:15 -0700] - [08/Aug/2024:01:57:54 -0700]
70802/100000 (70%) requests succeeded (200 OKs)
12531/100000 (12%) requests were made by crawlers
19 unique crawlers were detected
Top 3 crawlers:
3107 https://openai.com/gptbot
2952 http://www.google.com/bot.html
1758 http://www.semrush.com/bot.html
Top 3 most-crawled courses:
28603 cse163
3332 cse341
3332 cse154
Appendix: Methodology¶
Warning
This portion is not required for the homework; it’s just for fun!
As an aside, there are a few limitations with this analysis:
- the way we look for successful requests is a non-trivial undercount: there are more successful status codes than just
200
. If you’re curious, see List of HTTP status codes. - the way that we look for crawlers is a non-trivial undercount: many bots do not follow the format we described above.
- For example,
- some bots do not have URLs that start with
http
- some bots do not include the
+
- some bots do contain a
+http
and then eventually a)
, but preceding the)
is more non-URL characters - some bots do not contain the word “bot”
- some bots completely lie about who they are in the user agent (not much we can do here!)
- some bots do not have URLs that start with
- Catching bots — both crawlers and other types of bots — is a real-world problem that many software engineers (and huge companies) work on. Generally speaking, it’s an arms race/cat-and-mouse game; people who write bots are frequently trying to evade detection!
- For example,
- the way that we count course codes is slightly incorrect, since:
- not all requests are
GET
requests - some (very old) course website paths don’t exactly follow the
/courses/COURSE_CODE
format - not all CSE course codes follow the format we described above (e.g. CSE 390HA)
- not all requests are
We chose to ignore these issues to make this a reasonably-scoped project. But, if you’re interested, you can dive more into the above problems and give them a shot!