CSE 374 16wi :: Homework 2
Due: Thursday, Jan. 21, 2016, at 23.00.
Assignment goal
The main goal of this assignment is to get more practice
writing shell scripts and using regular expressions and string
processing programs, particularly grep
and sed
. You will also learn about accessing
files from the web and a little more about
using gnuplot
.
Getting Ready
Please use the discussion board for this assignment to post questions and help each other out.
Documentation
In addition to the lecture slides, you may find "The Linux
Pocket Guide" a useful reference for completing this assignment
(in particular, pages 195 and following [pp. 166 and later in the
first edition]).
Online manuals:
In general, whenever you need to use a new tool, you should get
into the habit of looking for documentation online. There are
usually good tutorials and examples that you can learn from. As you
work on this assignment, if you find that you would like more
information about a tool (sed, grep, or gnuplot), try searching for
the name of the tool or the name of the tool followed by keywords
such as "tutorial" or "documentation". Also be
sure to use the regular Unix documentation (man pages and info
command), and experiment with commands to try out various options
and see what they do.
Data Files
Download the file: hw2.tar.
Extract all the files for this assignment using the following command:
> tar -xvf hw2.tar
You should now see a directory called hw2
.
Background
Because you did very well in CSE 374, you were offered a summer
position in a research group. As part of your summer job, you
would like to run the following experiment. Given a list of 100
popular websites, you would like to measure the sizes of their
index pages (the first index.html
file that the
browser downloads when you visit the site).
We provide you the list of popular websites in the
file popular.html
(this particular list was taken
from 100best
websites.org a while back, but even though it is a bit
old, it will server our purposes for this assignment).
One approach would be to manually download each index page in the
list, and use wc
to compute its size in bytes. You
could then manually put all the results in a table and plot some
graphs to display the results. This approach would be long,
tedious, and error prone. It would also be painful if you wanted
to repeat the experiment on the 100 least popular
websites. Instead, you decide to automate the experiment by
writing a set of scripts.
Download a page and compute its size
Specification
Write a script perform-measurement.sh
with the following features:
- Takes a single URL as an argument
- Downloads the file at that URL to a temporary file
- Prints to stdout the size of that file in bytes (and
nothing else)
- Removes any temporary files it creates
- If no arguments are provided, prints an error message to stderr and exits
- If the argument is invalid, or downloading fails for any
other reason, prints "0" to stdout
For example, executing your script with the URL of homework 1 on
the class website as argument:
./perform-measurement.sh http://courses.cs.washington.edu/courses/cse374/16wi/hws/hw1.html
should output only:
15369
(This number was correct at the time this assignment was prepared,
but might be somewhat different if the page is modified some time in
the future.)
Implementation Advice
Downloading the URL: You can use the wget
program to download the file for any URL. Simply run wget URL
. Use man wget
to see its options, in particular the options to tell wget
to output to a specific file, to suppress output, to set the timeout (how long wget
will wait for a URL to load before giving), and to set the number of times wget
will try to download a URL. To check whether a download succeeded, use $?
to get the exit status of the most recent command. Remember that 0 indicates success, and anything else (positive or negative) indicates failure of some kind.
Temporary files: The mktemp
program produces unique file names. When run, mktemp
both creates a temporary file in the /tmp
system folder, and prints the name of the file to stdout. Thus, you can handle the temp file something like this
tmp_out=$(mktemp)
...
do something with $tmp_out
...
rm -f $tmp_out # use -f to prevent asking the user whether they're sure
Counting bytes: wc
will do the job here. Consult the man page for the appropriate option. Also, compare the output of wc a-test-file
and wc < a-test-file
.
Parsing the html list of websites
The list of popular websites is in html format. To run an
experiment automatically on each URL in this list, we need to extract
the URLs and write them into a text file. There are several ways in
which this can be done, and different utilities
(sed
, grep
) can help.
Specification
Write a script parse.sh
with the following features:
- Takes two arguments: the name of the input html file and the name of the output file for the results
- Writes a list of the URLs extracted from the first argument to the file given in the second argument, one per line.
- Uses
grep
and/or sed
even if you know other programs or languages (awk
, perl
, python
, ...) that could do similar things in different ways.
- If fewer than two arguments are provided, prints an error message to stderr and exit.
- If the input html file argument does not exist, prints an error message to stderr and exit.
- If the output file argument already exists, overwrites it without warning.
For example, executing:
parse.sh popular.html popular.txt
Should write content similar to the following into popular.txt
:
http://www.yahoo.com/
http://www.google.com/
http://www.amazon.com/
...
Implementation Advice
Step-by-step:
- Use
grep
to find all the lines that contain the string http
. Test if it works before proceeding to step 2.
- Use
sed
to replace everything that precedes the URL with the empty string. Test if it works before proceeding to step 3. Your sed command(s) must match the http://...
URL strings themselves, not surrounding text in the table. (i.e., your sed command must use a pattern that matches the URLs although, of course, it may contain more than that if needed to isolate the URL strings. But it can't just be .* surrounded by patterns that match whatever appears before and after the URLs in this particular data file.)
- Use
sed
to replace everything after the URL with the empty string as well. Test if everything works.
Writeup
Please address the following questions in a readme
file you turn in with your code
popular.txt
might not contain exactly 100 urls. Is it a bug in your script or a bug in the data? Explain how you chose to handle any extra urls.
- There are some URLs at the beginning and at the end of the file (such as
http://www.100bestwebsites.org/criteria
) that are not actually part of the list of 100 best web sites. It's up to you to figure out a reasonable way to handle this so they don't get included in the output - either by removing them somehow (by hand? with some sort of script?), or leaving them in and figuring out how best to deal with them. Explain what you did and why. This shouldn't be extensive or long-winded, just a few sentences about what the problem was and how you dealt with it.
Running the experiment
To perform the experiment, your need to execute the
script perform-measurement.sh
on each URL inside the
file popular.txt
. Once again, you would like to do this
automatically with a script.
Specification
Write a script run-experiment.sh
with the following features:
- Takes a file with a list of URLs as argument and
executes
perform-measurement.sh
on each URL in the
file.
- For each URL,
run-experiment.sh
should produce the following output, separated by spaces:
rank URL page-size
The rank
of a page is the line number of the
corresponding URL in popular.txt
(or whatever the input
file containing the URLs is named). The URL on the first line of the
table has rank 1, the URL on the second line has rank 2, and so on
until the last URL/line in the file. The URL
is the same
string as the argument you gave to perform-measurement.sh
. The page-size
is the result of perform-measurement.sh
.
run-experiment.sh
should write its output to a
file. The name of that file should be given by the user as second
argument.
- If
perform-measurement.sh
returns zero for a
URL, run-experiment.sh
should not write any output to
the file for that URL.
- Because it can take a long time for the experiment to finish,
your script should provide feedback to the user. The feedback should
indicate the progress of the experiment.
- Before executing
perform-measurement.sh
on a
URL, your script should print the following message:
"Performing measurement on
<URL>...
".
- Once
perform-measurement.sh
produces a value,
if the value is greater than zero, the script should output the
following message: "...success
". If the
value is zero, this means some error has occurred, and the
script should output the following message:
"...failed
".
To debug your script, instead of trying it directly
on popular.txt
, we provide you with a smaller
file: popular-small.txt
. You should execute your script
on popular-small.txt
until it works. Only then try it
on popular.txt
.
Executing your script as follows:
./run-experiment.sh popular-small.txt results-small.txt
Should produce output similar to the following:
Performing measurement on http://courses.cs.washington.edu/courses/cse374/16wi/...
...success
Performing measurement on http://i.will.return.an.error...
...failed
Performing measurement on http://courses.cs.washington.edu/courses/cse374/16wi/calendar/calendar.html...
...success
The content of results-small.txt
should be similar to that given below. Note that the exact values will change as we edit the class website!
1 http://courses.cs.washington.edu/courses/cse374/16wi/ 14795
3 http://courses.cs.washington.edu/courses/cse374/16wi/calendar/calendar.html 19272
As another example, after executing your script as follows:
run-experiment.sh popular.txt results.txt
The file result.txt
, should contain results somewhat like the ones shown below (when you run your experiment, the exact values will likely differ)
...
3 http://www.amazon.com/ 328011
4 http://www.about.com/ 512761
5 http://www.bartleby.com/ 47841
...
Plotting the results
It is hard to understand the results just by looking at a list of numbers, so you would like to produce a graph. More specifically, you would like to produce a scatterplot, where the x-axis will show the rank of a website and the y-axis will show the size of the index page.
Luckily, you talk about your problem to your friend Aisha. She suggests that you use a program called gnuplot
to produce the graph. Because she used it many times before, Aisha helps you write the necessary gnuplot script called produce-scatterplot.gnuplot.
Note that the gnuplot file expects your experimental results to be stored in a file called results.txt.
Produce the graph with the following command:
gnuplot produce-scatterplot.gnuplot
The script should produce a file called scatterplot.eps
. You can view it with evince
or any other program that knows how to display an eps file (these will require X11 forwarding on klaatu).
evince scatterplot.eps
Extra credit: Find a modern list of the most popular websites and write a script parse_new.sh
that performs the same URL extraction on that html. Produce a plot scatterplot_new.eps
of this data.
Writeup
Write your answers to the following questions in your readme
file:
- Examine the gnuplot file
produce-scatterplot.gnuplot
. Ignoring the first line, explain what the rest of the script does.
- Looking at the scatterplot, what can you conclude about the relationship between the popularity of a site and the size of its
index.html
file? Are these result what you expected?
Turn-in instructions
Use the turn-in drop box to submit the files you created to
Here is the list of files that you need to turn in:
perform-measurement.sh
parse.sh
run-experiment.sh
scatterplot.eps
readme
(label all answers clearly)
- (extra credit only)
parse_new.sh
and scatterplot_new.eps
Every file you turn in should have your name and information identifying the problem in a comment (except for the .eps file where this won't be feasible).