Due: Friday, May 5, 2022 11:59pm
You may not drop this project assignment
In this assignment we will put our skills together. You will get
more practice shell scripting, using regular expressions, and integrating
different types of processes. In particularly, you will use grep
and sed
. You will also learn about accessing files from the
web and how to incorporate a C executable and an R script into your work.
Taken in its entirety this assignment may seem quite long. You are encouraged to read through the described approach which is designed to break the assignment into managable pieces. By completing and testing each piece independently you will be successful in the overall project.
In addition to the lecture notes, you may find "The Linux Pocket Guide" a useful reference for completing this assignment.
Online manuals:
In general, whenever you need to use a new tool, you should get into the habit of looking for documentation online. There are usually good tutorials and examples that you can learn from. As you work on this assignment, if you find that you would like more information about a tool (sed, grep, or R), try searching for the name of the tool or the name of the tool followed by keywords such as "tutorial" or "documentation". Also be sure to use the regular Unix documentation (man pages and info command), and experiment with commands to try out various options and see what they do.
You realize that there are a lot of lines of code contained in all the sample
code for this course. You want to get a handle on how big the normal sample C module is.
In order to do this you plan to use your recently created wordcount
executable to measure the size of the files. Then you have an R script that will
let you look for any patterns in the files sizes.
To get started you use the course lectures web page, which you know lists all the source code files for each lecture. While you could download and analyze each file individually, you'd like to automate the process using shell scripts. You will write one script to find all the applicable source code URLs, a second script to obtain data about a single file, and then a third script that combines the first two scripts with your R analysis.
You are going to use this year's course webpage lecture list, which you know you can
download using wget https://courses.cs.washington.edu/courses/cse374/23sp/calendar/lecturelist.html
. To run an
experiment automatically on each C file on this page, you need to
extract each of the file URLs and write them into a text file.
There are several ways in which this can be done, and different utilities
(sed
, grep
) can help.
For this class, you must use grep
and/or sed
even if you know other programs or languages
(awk
, perl
, python
, ...)
that could do similar things in different ways. But it's fine to
use egrep
and extended regular expressions
in sed
and grep
if you wish.
In a file calledgeturls
, write a script that
extracts the URLs and writes them into a text file. The script should
take two arguments: the name of the output file for results and the
name of the input html file. You want to extract ONLY the URLs for C code
files. (And, even if you notice a pattern for these files
you want to rely on your extraction code to retrieve them.)
For example, executing:
$ ./geturls urllist lecturelist.html
Should write content similar to the following into urllist
:
https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/hello.c https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/magic.c https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/printargs.c https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/square1.c https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/argumentdemo.c ...
If the user provides fewer than 2 arguments, the script should print an error message and exit with a return code of 1.
If the text file provided as argument (for the output) exists, the script should overwrite it with a warning, and the continue processing.
If the html file provided as argument (for the input) does not exist,
the script should print an appropriate error message and exit with a
return code of 1.
You may want to consider what this would look like if you allowed for
multiple input files, but you only need to implement the code to handle one.
If the script does not report any errors, it should exit with a return code of 0.
Hints: step-by-step instructions
grep
to find all the lines that contain a link to a C file,
perhaps in combination with the string href
. You might investigate the
only-matching
option for grep.
Test if it works (does it return the lines with the code links in them?) before proceeding to
step 2. You will ultimately pipe the output of this grep search into
step 2.sed
to edit the lines that are found by grep and
remove any text that comes before or after the actual URL. You may also
need to add some text to make the results into valid URLs. There are
different ways to use sed
to do this, but we request that
you use a pattern to isolate the file name. (In other words, search for the file name,
not any common text that comes before or after it.)sed
command, or multiple commands
strung together with pipes.measurepage
working you
can double check this again - are most of the URLs you found real files?In a file called measurepage
, write a bash
script that takes a URL as an argument and outputs the size of the
corresponding page in words.
For example, executing your script with the URL of hello.c on the class website as argument:
$ measurepage https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/hello.c
should output only 31 to standard output:
31
(This number was correct at the time this assignment was prepared, but might be somewhat different if the file is modified some time in the future.)
If the user does not provide any arguments, the script should print an appropriate error message and exit with a return code of 1.
If the user provides an erroneous argument or if downloading the requested page fails for any other reason, the script should simply print the number "0" (zero). In this case, or if the page is downloaded successfully, the script should exit with a return code of 0 after printing the number to to standard output.
Hints:
wget
program downloads files from the
web. Use man wget
to see its options.mktemp
program produces unique file names
for temporary files. If you create a temporary file, you should
remove it before your script exits. Generally it is best to create
temporary files like this in /tmp
.wordcount
executable from
HW3 to count the number of words in each downloaded file. You can
copy the executable into the same directory as your HW4 scripts. (If you
do not have a working copy of wordcount, you may use this one.)
Notice that you can use this approach to include other executables you have created
into shell scripts.
wordcount
instead of ./wordcount
/dev/null
. For example try ls >
/dev/null
To perform the experiment, your need to execute the
script measurepage
on each URL inside the
fileurllist
. Once again, you would like to do this
automatically with a script.
In a file called runanalysis
, write a shell
script that:
measurepage
on each URL in the
file.runanalysis
should produce the following output, separated by spaces:
index-number page-size
index-number
just numbers the URLs from 1-N in the
order they are encountered.page-size
is
the result of measurepage
. measurepage
on a
URL, your script should print the following message:
"Performing wordcount measurement on
<URL>...
". measurepage
produces a value,
if the value is greater than zero, the script should output the
following message: "...successful
". If the
value is zero, this means some error has occurred, and the
script should output the following message:
"...failure
". run-anaylsis
finishes, it should exit with
a return code of 0.Since the course page is relatively small, you should be able to debug your scripts by working with the actual pages for this year. However, it will be useful to set up a test case for error checking. For example:
$ echo https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/magic.c > testurls $ echo https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/notaurl >> testurls $ echo https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/mystery.c >> testurls $ runanalysis dataout testurls Replacing dataout Performing word count measurement on https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/magic.c... ...successful Performing word count measurement on https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/notaurl... ...failure Performing word count measurement on https://courses.cs.washington.edu/courses/cse374/23sp/lectures/ccode/mystery.c... ...successful
And the content of dataout
should be similar
to the ones below. Note that the exact values may change over time!
$ cat dataout 1 43 2 94
It is hard to understand the results just by looking at a list of numbers, so you would like to produce a graph. More specifically, you would like to produce a scatterplot, where the x-axis will show the url number and the y-axis will show the size of the index page.
Luckily, you've used R for some of your statistics courses, so you find
a script called scatterplot.R
(Right click to download
scatterplot.R). Note that this script
expects your experimental results to be stored in a file
called dataout
, so you'll need to name your output file accordingly.
You will procude a plot by calling the R script at the command line. You can
make sure your system has R installed by typing R --version
at your
command prompt. If it is not installed, you will may need to
install it.
The script should produce a file
called scatterplot_out.jpg
. You can view this file with any
image viewer.
You will want to submit one more file, by following these steps
script hw4.script
echo $USER
or echo $USERNAME
(depending on your system)..
) to your PATH./
in front of it's nameecho $PATH
exit
to stop recording your steps.bash
on our reference system (Seaside).Notice that this assignment is a PROJECT, which means it is not eligible to be dropped at the end of the quarter.
Please submit your files via Gradescope. You may upload multiple individual files. You will need to submit geturls
, measurepage
, runanalysis
, scatterplot-out.jpg
, and hw4.script
. We will evaluate using our own copy of wordcount
.
There will be an autograder in place that will do much of the evaluation, but there will be additional manual grading to evaluate style and your analysis in the write-up. You may resubmit your homework as many times as you like up to the initial due-date.