grep
and sed
. gnuplot
for plotting files.Note: The instructions for this assignment may seem a bit long. This is because we try to give you plenty of samples and hints for each part. We hope this will help you complete the assignment faster.
In addition to the lecture slides, you may find "The Linux Pocket Guide" a useful reference for completing this assignment (in particular, pages 195 and following [pp. 166 and later in the first edition]).
sed
, grep
, or gnuplot
), try searching
for the name of the tool or the name of the tool followed by
keywords such as "tutorial" or "documentation". Also be sure to
use the regular Unix documentation (man
pages and info
command),
and experiment with commands to try out various options and see
what they do.
index.html
file that the browser downloads when you visit the site). You
suspect that popular sites must have very small index pages
because they need to handle a heavy user load.
We provide you the list of popular websites in the file
popular.html
(this list was taken from
100best websites.org a
few days ago).
One approach would be to manually download each index page in
the list, and use wc
to compute its
size in bytes. You could then manually put all the results in a
table and plot some graphs to display the results. This approach
would be long, tedious, and error prone. It would also be painful
if you wanted to repeat the experiment on the 100 least popular
websites.
Instead, you decide to automate the experiment by writing a set of scripts:
Download the file: hw3.tar
. Extract all
the files for this assignment using the following command:
tar -xvf hw3.tar
You should now see a directory called hw3
.
If you see it, you are ready to start the assignment. If this
procedure did not work for you, please post a message on the
discussion board describing the problem to see if someone else has any
ideas, or send email to cse374-staff[at]cs
, or talk to another
student in the class.
In a file called perform-measurement.sh
, write a bash
script that
takes a URL as an argument and outputs the size of the
corresponding page in bytes.
For example, executing your script with the URL of Homework 1 on the previous class's website as argument:
./perform-measurement.sh
http://www.cs.washington.edu/education/courses/cse374/13wi/homework/hw1.html
should output only 11920:
11920
(This number was correct at the time this assignment was prepared, but might be somewhat different if the page is modified some time in the future.)
If the user does not provide any arguments, the script should print an appropriate error message and exit.
If the user provides an erroneous argument or downloading the requested page fails for any other reason, the script should simply print the number "0" (zero).
perform-measurement.sh
to make it
executable.wget
program downloads files
from the web. Use man wget
to see its
options.mktemp
program produces
unique file names for temporary files. If you create a
temporary file, you should remove it before your script exits.
Generally it is best to create temporary files like this in directory
/tmp
.wc a-test-file
and wc <
a-test-file
./dev/null
. For example try
ls > /dev/null
The list of popular websites is in HTML format. To run an
experiment automatically on each URL in this list, we need to
extract the URLs and write them into a text file. There are
several ways in which this can be done, and different utilities
(sed
, grep
) can help.
You should use grep
and sed
even if you know other programs
(awk
, perl
, ...) that could be used
instead.
In a file called parse.sh
, write a
script that extracts the URLs and writes them into a text file.
The script should take two arguments: the name of the input HTML
file and the name of the output file for the results.
For example, executing:
parse.sh popular.html popular.txt
should write content similar to the following into
popular.txt
:
http://www.yahoo.com/
http://www.google.com/
http://www.amazon.com/
...
If the user provides fewer than 2 arguments, the script should print an error message and exit.
If the HTML file provided as argument does not exist, the script should print an appropriate error message and exit.
If the text file provided as argument (for the output) exists, the script should simply overwrite it without any warning.
Q: How come popular.txt
might not contain exactly
100 URLs? Is it a bug in your script or a bug in the data? You
don't need to answer this question in writing, just think about
it for yourself. You do need to explain in a
readme
file submitted with the rest of the
assignment how you chose to handle any extra URLs.
grep
to find all the
lines that contain the string http
.
Test if it works before proceeding to step 2.sed
to replace
everything that precedes the URL with the empty string. Test if
it works before proceeding to step 3. Your sed
command(s) must
match the http://...
URL strings themselves, not
surrounding text in the table. (i.e., your sed
command must use
a pattern that matches the URLs although, of course, it may
contain more than that if needed to isolate the URL strings.
But it can't just be .*
surrounded by patterns that match
whatever appears before and after the URLs in this particular
data file.)sed
to replace
everything after the URL with the empty string as well. Test if
everything works.Note: there are some URLs at the beginning and at the end of
the file (such as http://www.100bestwebsites.org/criteria
) that are
not actually part of the list of 100 best web sites. It's up to
you to figure out a reasonable way to handle this so they don't
get included in the output - either by removing them somehow (by
hand? with some sort of script?), or leaving them in and figuring
out how best to deal with them. You should explain what you did
in a readme
file that you turn in with your code.
This shouldn't be extensive or long-winded, just a sentence or
two about what the problem was and how you dealt with it.
To perform the experiment, you need to execute the script
perform-measurement.sh
on each URL
inside the file popular.txt
. Once
again, you would like to do this automatically with a script.
In a file called run-experiment.sh
,
write a shell script that:
perform-measurement.sh
on each URL in
the file.run-experiment.sh
should produce the following
output, separated by spaces:
rank URL page-size
The rank of a page is
the line number of the corresponding URL in popular.txt
(or whatever the input file
containing the URLs is named). The URL on the first line of
the table has rank 1, the URL on the second line has rank 2,
and so on until the last URL/line in the file. The
URL is the same string as the
argument you gave to perform-measurement.sh
.
The page-size is the result of perform-measurement.sh
.
run-experiment.sh
should write
its output to a file. The name of that file should be given by
the user as second argument.perform-measurement.sh
returns
zero for a URL, run-experiment.sh
should not write any output to the file for that URL.perform-measurement.sh
on a URL, your script
should print the following message: "Performing measurement on
<URL>..."
.perform-measurement.sh
produces a value, if the value is greater than zero, the
script should output the following message: "...success"
. If the value is zero, this means
some error has occurred, and the script should output the
following message: "...failed"
.To debug your script, instead of trying it directly on
popular.txt
, we provide you with a
smaller file: popular-small.txt
. You
should execute your script on popular-small.txt
until it works. Only then try it
on popular.txt
.
Executing your script as follows:
run-experiment.sh popular-small.txt results-small.txt
should produce output similar to the following:
Performing measurement on
http://www.cs.washington.edu/education/courses/cse374/13wi/
...success
Performing measurement on http://i.will.return.an.error
...failed
Performing measurement on
http://www.cs.washington.edu/education/courses/cse374/13wi/syllabus.html
...success
And the content of results-small.txt
should be similar to the ones below. Note that the exact
values may have changed if the previous class's website was edited!
1
http://www.cs.washington.edu/education/courses/cse374/13wi/
8552
3
http://www.cs.washington.edu/education/courses/cse374/13wi/syllabus.html
14078
As another example, after executing your script as follows:
run-experiment.sh popular.txt results.txt
The file result.txt
should contain
results somewhat like the ones shown below (when you run your
experiment, the exact values will likely differ)
...
3 http://www.amazon.com/ 157147
4 http://www.about.com/ 96537
5 http://www.bartleby.com/ 41402
...
It is hard to understand the results just by looking at a list of numbers, so you would like to produce a graph. More specifically, you would like to produce a scatterplot, where the x-axis will show the rank of a website and the y-axis will show the size of the index page.
Luckily, you talk about your problem to your friend Alice. She
suggests that you use a program called gnuplot
to produce the graph. Because she used it
many times before, Alice helps you write the necessary gnuplot
script called produce-scatterplot.gnuplot
(look in the hw3
directory created by extracting the hw3.tar
file). Note that the gnuplot
file expects your experimental results to be stored in a file
called results.txt
.
Produce the graph with the following command:
gnuplot produce-scatterplot.gnuplot
The script should produce a file called scatterplot.eps
. You can view it with
evince
or any other program that knows how to
display an EPS (Encapsulated PostScript) file.
evince scatterplot.eps
If you are working on klaatu
or some other remote
machine, you can either transfer the .eps
file to your local
machine and view it there, or you can see it by running the
viewer program remotely. In the latter case you may need to use
the -X
option on ssh
(ssh -X klaatu.cs....
) or the equivalent on
your remote login application (PuTTY, for example). This sets up
the connection so the remote viewer program can open a window on
your local machine to display the results.
If you are using the CSE Fedora VM and evince
is
not installed, use Fedora's software installation program to add
it.
Write your answers to the following questions in a file called
problem4.txt
:
Q1: Examine the gnuplot
file produce-scatterplot.gnuplot
. Ignoring the first
line, explain what the rest of the script does.
Q2: Looking at the scatterplot, what can you conclude about
the relationship between the popularity of a site and the size of
its index.html
file? Are these result
what you expected?
bash
on our reference systems (klaatu
and the CSE virtual machine).Identifying information including
.eps
file).
Use the turn-in drop box link on the main course web page to submit your files:
perform-measurement.sh
parse.sh
run-experiment.sh
problem4.txt
and scatterplot.eps
readme
file containing your (brief) notes
about how you dealt with extraneous URLs or other problems in
the input data.If you do combine your files into
an archive, use some straightforward format like tar
or zip
and
don't use exotic compression formats. (We want to be able to
unscramble what you turn in without too much guessing.)