Due: Monday, April 27, 2020, at 11:59pm
The main goal of this assignment is to learn more about shell
scripting and using regular expressions and string processing
programs, particularly grep and sed. You
will also learn about accessing files from the web and incorporating
an R script in your work.
Taken in its entirety this assignment may seem quite long. You are encouraged to read through the described approach which is designed to break the assignment into managable pieces. By completing and testing each piece independently you will be successful in the overall project.
In addition to the lecture notes, you may find "The Linux Pocket Guide" a useful reference for completing this assignment.
Online manuals:
In general, whenever you need to use a new tool, you should get into the habit of looking for documentation online. There are usually good tutorials and examples that you can learn from. As you work on this assignment, if you find that you would like more information about a tool (sed, grep, or R), try searching for the name of the tool or the name of the tool followed by keywords such as "tutorial" or "documentation". Also be sure to use the regular Unix documentation (man pages and info command), and experiment with commands to try out various options and see what they do.
Download the file: hw3.tar. Extract all the files for this assignment using the following command:
$ tar -xvf hw3.tar
You should now see a directory called hw3.
If you see it, you are ready to start the assignment. If this
did not work for you, please post a message on the
HW3 discussion describing the problem to see if someone has any
ideas, or contact a TA or the instructor (if you send mail, please use
cse374-staff[at]cs), or talk to another student
in the class.
You are interested in evaluating how 'user-friendly' different CSE courses are. You have a theory that courses with more information on their web pages will be easier to navigate, and you want to test your theory. To achieve this you have decided plot the sizes of the front page for each course and look for trends in the data. In order to get the most current information you want the front page for 20sp (Spring quarter, 2020).
To get started you use the CSE webpage on which all the current courses are listed. You could look at each course webpage manually, but you decide to write a scrip to automatically extract webpages, measure their size, and generate the required plot.
We have provided the wepage listing all the courses in html format.
(This is the courses-index.html file provided if the assignment
set up.) To run an
experiment automatically on each URL in this list, we need to extract
the URLs and write them into a text file. There are several ways in
which this can be done, and different utilities
(sed, grep) can help.
You must use grep
and/or sed even if you know other programs or languages
(awk, perl, python, ...)
that could do similar things in different ways. But it's fine to
use egrep and extended regular expressions
in sed and grep if you wish.
In a file calledgetcourses, write a script that
extracts the URLs and writes them into a text file. The script should
take two arguments: the name of the output file for results and the
name of the input html file.
For example, executing:
$ ./getcourses courselist courses-index.html
Should write content similar to the following into courselist:
http://courses.cs.washington.edu/courses/cse120/20sp/
http://courses.cs.washington.edu/courses/cse131/20sp/
http://courses.cs.washington.edu/courses/cse142/20sp/
...
If the user provides fewer than 2 arguments, the script should print an error message and exit with a return code of 1.
If the text file provided as argument (for the output) exists, the script should simply overwrite it without any warning.
If the html file provided as argument (for the input) does not exist, the script should print an appropriate error message and exit with a return code of 1.
If the script does not report any errors, it should exit with a return code of 0.
Hints: step-by-step instructions
grep to find all the lines that contain
the string http. Test if it works before proceeding to
step 2. sed to replace everything that precedes
the URL with the empty string. Test if it works before proceeding
to step 3. For this assignment, your sed command(s) must match
the http://... URL strings themselves, not
surrounding text in the table. (i.e., your sed command must use a
pattern that matches the URLs although, of course, it probably will contain
more than that if needed to isolate the URL strings. But it can't
just be .* surrounded by patterns that match whatever appears
before and after the URLs in this particular data file.)sed to replace everything after the URL as well.
Test if everything works - does your final list of URLs look correct, or
do you need different replacement text? In a file called perform-measurement.sh, write a bash
script that takes a URL as an argument and outputs the size of the
corresponding page in bytes.
For example, executing your script with the URL of homework 1 on the class website as argument:
$ ./perform-measurement.sh http://courses.cs.washington.edu/courses/cse374/20sp/assignments/
should output only 5597 to standard output:
5597
(This number was correct at the time this assignment was prepared, but might be somewhat different if the page is modified some time in the future.)
If the user does not provide any arguments, the script should print an appropriate error message and exit with a return code of 1.
If the user provides an erroneous argument or if downloading the requested page fails for any other reason, the script should simply print the number "0" (zero). In this case, or if the page is downloaded successfully, the script should exit with a return code of 0 after printing the number to to standard output.
Hints:
wget program downloads files from the
web. Use man wget to see its options.mktemp program produces unique file names
for temporary files. If you create a temporary file, you should
remove it before your script exits. Generally it is best to create
temporary files like this in /tmp.wc
a-test-file and wc < a-test-file. /dev/null. For example try ls >
/dev/null To perform the experiment, your need to execute the
script perform-measurement.sh on each URL inside the
file courselist. Once again, you would like to do this
automatically with a script.
In a file called run-analysis, write a shell
script that:
perform-measurement.sh on each URL in the
file.run-analysis.sh
should produce the following output, separated by spaces:
course-number page-sizeThe
course-number is the three-digit course number. You can extract
this from the URL you give to perform-analysis using a similar method
to the one you used to parse the original course listings. You will want only
the three digit numerals - not any letter section extensions.
The page-size is
the result of perform-measurement.sh. perform-measurement.sh on a
URL, your script should print the following message:
"Performing byte-size measurement on
<URL>...". perform-measurement.sh produces a value,
if the value is greater than zero, the script should output the
following message: "...successful". If the
value is zero, this means some error has occurred, and the
script should output the following message:
"...failure". run-anaylsis finishes, it should exit with
a return code of 0.To debug your script, instead of trying it directly
on courses-index.html, we provide you with a smaller
file: popular-small.txt. You should execute your script
on popular-small.txt until it works, and then work on courselist.
Executing your script as follows:
$ ./run-analysis dataout popular-small.txt
Should produce output similar to the following:
Performing byte-size measurement on http://courses.cs.washington.edu/courses/cse374/20sp/
...successful
Performing byte-size measurement on http://courses.cs.washington.edu/i.will.return.an.error...
...failure
Performing byte-size measurement on http://courses.cs.washington.edu/courses/cse374/20sp/assignments/hw3.html
...successful
And the content of results-small.txt should be similar
to the ones below. Note that the exact values will change as we edit the
class website!
374 3803
374 19098 It is hard to understand the results just by looking at a list of numbers, so you would like to produce a graph. More specifically, you would like to produce a scatterplot, where the x-axis will show the course number and the y-axis will show the size of the index page.
Luckily, you've used R for some of your statistics courses, so you find
a script called scatterplot.R Note that this script
expects your experimental results to be stored in a file
called dataout.
You will procude a plot by calling the R script at the command line. You can
make sure your system has R installed by typing R --version at your
command prompt. If it is not installed, you will may need to
install it.
The script should produce a file
called scatterplot_out.jpg. You can view this file with any
image viewer.
You will want to submit one more file, which should be called hw3summary. This file should contain the following:
bash on either of our
reference systems (klaatu or the current CSE Linux virtual
machine).Please submit your files to the Canvas HW3 Assignment. You will need to submit getcourses, perform-measurement, runanalysis, scatterplot-out.jpg, and hw3summary.
You should combine your files into an archive (see the tar command) and turn that in as a single file.
The drop box will allow you to turn in your homework up to four days late, but remember that you will be penalized 20% for each additional day you take.