Due Date

Due: Tuesday, January 27, 2015 at 11 PM.

Objectives

The main goal of this assignment is for you to learn more about You will also learn about

References

Note: The instructions for this assignment may seem a bit long. This is because we try to give you plenty of samples and hints for each part. We hope this will help you complete the assignment faster.

In addition to the lecture slides, you may find "The Linux Pocket Guide" a useful reference for completing this assignment (in particular, pages 195 and following [pp. 166 and later in the first edition]).

Online manuals

In general, whenever you need to use a new tool, you should get into the habit of looking for documentation online. There are usually good tutorials and examples that you can learn from. As you work on this assignment, if you find that you would like more information about a tool (sed, grep, or gnuplot), try searching for the name of the tool or the name of the tool followed by keywords such as "tutorial" or "documentation". Also be sure to use the regular Unix documentation (man pages and info command), and experiment with commands to try out various options and see what they do.

Assignment

Background

Because you did very well in CSE 374, you were offered a summer position in a research group. As part of your summer job, you would like to run the following experiment. Given a list of 100 popular websites, you would like to measure the sizes of their index pages (the first index.html file that the browser downloads when you visit the site). You suspect that popular sites must have very small index pages because they need to handle a heavy user load.

We provide you the list of popular websites in the file popular.html (this list was taken from 100best websites.org a few days ago).

One approach would be to manually download each index page in the list, and use wc to compute its size in bytes. You could then manually put all the results in a table and plot some graphs to display the results. This approach would be long, tedious, and error prone. It would also be painful if you wanted to repeat the experiment on the 100 least popular websites.

Instead, you decide to automate the experiment by writing a set of scripts:

Part 0. Getting started

Download the file: hw3.tar. Extract all the files for this assignment using the following command:

tar -xvf hw3.tar

You should now see a directory called hw3.

If you see it, you are ready to start the assignment. If this procedure did not work for you, please post a message on the discussion board describing the problem to see if someone else has any ideas or talk to another student in the class.

Part 1. Downloading a page and computing its size

In a file called perform-measurement.sh, write a bash script that takes a URL as an argument and outputs the size of the corresponding page in bytes.

For example, executing your script with the URL of Homework 1 on the previous class's website as argument:

./perform-measurement.sh http://www.cs.washington.edu/education/courses/cse374/15wi/homework/hw1.html

should output only 12513:

12513

(This number was correct at the time this assignment was prepared, but might be somewhat different if the page is modified some time in the future.)

If the user does not provide any arguments, the script should print an appropriate error message and exit.

If the user provides an erroneous argument or downloading the requested page fails for any other reason, the script should simply print the number "0" (zero).

Hints:

Part 2. Parsing the HTML list of websites

The list of popular websites is in HTML format. To run an experiment automatically on each URL in this list, we need to extract the URLs and write them into a text file. There are several ways in which this can be done, and different utilities (sed, grep) can help.

NOTE: we refer to grep throughout this document, but its friendlier variant egrep is also perfectly fine to use.

You should use grep and sed even if you know other programs (awk, perl, ...) that could be used instead.

In a file called parse.sh, write a script that extracts the URLs and writes them into a text file. The script should take two arguments: the name of the input HTML file and the name of the output file for the results.

For example, executing:

parse.sh popular.html popular.txt

should write content similar to the following into popular.txt:

http://www.yahoo.com/
http://www.google.com/
http://www.amazon.com/
...

If the user provides fewer than 2 arguments, the script should print an error message and exit.

If the HTML file provided as argument does not exist, the script should print an appropriate error message and exit.

If the text file provided as argument (for the output) exists, the script should simply overwrite it without any warning.

Q: How come popular.txt might not contain exactly 100 URLs? Is it a bug in your script or a bug in the data? You don't need to answer this question in writing, just think about it for yourself. You do need to explain in a readme file submitted with the rest of the assignment how you chose to handle any extra URLs.

Hints:

Step-by-step instructions

Note: there are some URLs at the beginning and at the end of the file (such as http://www.100bestwebsites.org/criteria) that are not actually part of the list of 100 best web sites. It's up to you to figure out a reasonable way to handle this so they don't get included in the output - either by removing them somehow (by hand? with some sort of script?), or leaving them in and figuring out how best to deal with them. You should explain what you did in a readme file that you turn in with your code. This shouldn't be extensive or long-winded, just a sentence or two about what the problem was and how you dealt with it.

Part 3. Running the experiment

To perform the experiment, you need to execute the script perform-measurement.sh on each URL inside the file popular.txt. Once again, you would like to do this automatically with a script.

In a file called run-experiment.sh, write a shell script that:

To debug your script, instead of trying it directly on popular.txt, we provide you with a smaller file: popular-small.txt. You should execute your script on popular-small.txt until it works. Only then try it on popular.txt.

Executing your script as follows:

run-experiment.sh popular-small.txt results-small.txt

should produce output similar to the following:

Performing measurement on http://courses.cs.washington.edu/courses/cse374/13wi/
...success
Performing measurement on http://i.will.return.an.error
...failed
Performing measurement on http://courses.cs.washington.edu/courses/cse374/13wi/syllabus.html
...success

And the content of results-small.txt should be similar to the ones below. Note that the exact values may have changed if the old 13wi website was edited!

1 http://www.cs.washington.edu/education/courses/cse374/13wi/ 8552
3 http://www.cs.washington.edu/education/courses/cse374/13wi/syllabus.html 14078

As another example, after executing your script as follows:

run-experiment.sh popular.txt results.txt

The file results.txt should contain results somewhat like the ones shown below (when you run your experiment, the exact values will likely differ)

...
3 http://www.amazon.com/ 412333
4 http://www.about.com/ 634459
5 http://www.bartleby.com/ 47505
...

Part 4. Plotting the results

It is hard to understand the results just by looking at a list of numbers, so you would like to produce a graph. More specifically, you would like to produce a scatterplot, where the x-axis will show the rank of a website and the y-axis will show the size of the index page.

Luckily, you talk about your problem to your friend Alice. She suggests that you use a program called gnuplot to produce the graph. Because she used it many times before, Alice helps you write the necessary gnuplot script called produce-scatterplot.gnuplot (look in the hw3 directory created by extracting the hw3.tar file). Note that the gnuplot file expects your experimental results to be stored in a file called results.txt.

Produce the graph with the following command:

gnuplot produce-scatterplot.gnuplot

The script should produce a file called scatterplot.eps. You can view it with evince or any other program that knows how to display an EPS (Encapsulated PostScript) file.

evince scatterplot.eps

If you are working on klaatu or some other remote machine, you can either transfer the .eps file to your local machine and view it there, or you can see it by running the viewer program remotely with X11 forwarding. In the latter case you may need to use the -X option on ssh (ssh -X klaatu.cs....) or the equivalent on your remote login application (PuTTY, for example). This sets up the connection so the remote viewer program can open a window on your local machine to display the results.

If you are using the CSE Fedora VM and evince is not installed, use Fedora's software installation program to add it.

Write your answers to the following questions in a file called problem4.txt:

Q1: Examine the gnuplot file produce-scatterplot.gnuplot. Ignoring the first line, explain what the rest of the script does.

Q2: Looking at the scatterplot, what can you conclude about the relationship between the popularity of a site and the size of its index.html file? Are these result what you expected?

Assessment

Your solutions should be

Identifying information including

should appear as comments at the top of each of your text files (it will not be feasible in the .eps file).

Turn-in Instructions

Use the turn-in drop box link on the main course web page to submit your files:

If you do combine your files into an archive, you may only use tar or zip. (We want to be able to read what you turn in).