CSE 341 -- Assignment 3 -- Java Warmup #1

Due in lecture October 23, 1998

WebWorm

Web crawlers index the web by starting with some known URLs, finding pages that these pages point to, then pages these pages point to, and so forth, indexing as they go. The WebWorm implements a small part of this functionality.

Write and test a Java program that accepts a URL from the command line and prints out a list of all the links from that page. For example, here is a sample output from my program:

orcas% java WebWorm http://www.cs.washington.edu/education/courses/341/CurrentQtr/

Links found in http://www.cs.washington.edu/education/courses/341/CurrentQtr/
 href="mailto:borning@cs.washington.edu"
 href="http://www.cs.washington.edu/homes/hartline/"
 href="mailto:hartline@cs.washington.edu"
 href="general/overview.html"
 href="general/schedule.html"
 href="http://www.cs.washington.edu/education/syllabi/341.html"
 href="general/families.html"
 href="scheme/"
 href="java/"
 href="assignments/"
 href="http://www.cs.washington.edu/education/courses/341/98sp/index.html"
 href="http://www.cs.washington.edu/education/courses/341/98wi/index.html"
 href="http://www.cs.washington.edu/education/courses/341/97au/index.html"
For this assignment we'll define a link as anything between the string "<a" and the matching closing bracket ">"

Hints: a Java program to read the contents of a Web page is on page 345 of the text. It is also on ~borning/java/ReadURL.java on orcas if you want to avoid retyping it. I'd use this program as a starting point. The code includes an exception handler (the try - catch stuff). You can use this part of the program unaltered (and without worrying about completely understanding it at this point) -- we'll talk about exceptions in more detail shortly in the lectures, and they are also covered in Chapter 16 of the text.

There are some useful methods in class String for searching a string for a substring, extracting a substring, and so forth.

The code in ~borning/java/ReadURL.java will work both on the antique version of Java on orcas and on new versions. However, the DataInputStream readLine method is "deprecated" in Java 1.2 -- that is, there is now a better way to achieve that functionality. The method will still work but you'll get a warning from the compiler. (Browse to DataInputStream readLine in the JDK 1.2 docs for a suggestion on a different way to implement this -- but it's fine for this assignment to just use the old method.)

If you read the contents of the URL line by line (as in the ReadURL program), you don't have to deal fully with the problem of links that span multiple lines in the html file -- if you find a string <a on an input line with no matching closing bracket, then just print out as much of the link as you find on that line. (The html for the CS department home page has such a link.) If you are bothered by the unaesthetic nature of this solution, and want to handle this problem correctly, I'd read characters directly from the DataInputStream rather than reading a line at a time -- but you don't need to do this.