CSE341 -- Assignments -- Java warmup

Due 26 April 2000.

This is a practice assignment to give you some experience with Java.

Submission directions: Please turn in both printed and electronic copies of your code. The print submission is due in class on the due date above. You may e-mail the electronic copies to Keunwoo (klee@cs) anytime the same day. For full credit, you must follow the submission guidelines.

Overview

Web crawlers (or "spiders") are programs that index the web by starting with some known URLs, finding pages that these pages point to, then pages these pages point to, and so forth, indexing as they go.

Write and test a Java program that accepts a URL and a number n from the command line, then follows n random links, starting from the given URL, printing each URL it visits. For example, here is a sample invocation and output:

orcas% java WebWorm http://www.cs.washington.edu/education/courses/341/CurrentQtr/ 5

Starting crawl from http://www.cs.washington.edu/education/courses/341/CurrentQtr/
 href="http://www.cs.washington.edu/homes/borning/"
 href="http://www.cs.washington.edu/education/courses/510/99wi/"
 href="http://www.uwtc.washington.edu/courses_and_programs/courses/tc_517/Default.htm"
 href="http://www.uwtc.washington.edu/"
 href="http://www.uwtc.washington.edu/other_resources/Default.htm"

If you are unfamiliar with HTML, the UW has list of resources for learning HTML and the web that you might find useful.

Details and definitions

For the purposes of this assignment, a link is defined as a string of text that:

The target of a link is the part that goes between the href= and the closing > symbol. The target may optionally be encapsulated in quotes; you should strip these before attempting to follow the target. Targets may be absolute (they may contain a full URL, e.g. "http://www.cs.washington.edu") or relative (they only contain a relative path, e.g. "../perl/index.html" and "java/example.html"). Your crawler must be able to handle both cases. You do not need to parse non-HTML targets.

Your program should basically consist of a loop that fetches the page at a given URL, extracts and parses all the links (accounting for relative URLs), selects one at random, and then fetches it.

You do not have to handle links that span multiple lines. Also, if you encounter a dead end (a page with no links), your program should exit and let the user know.

For extra credit, you can adapt your crawler to do something reasonable when it encounters circular chains of URL references.

Tips:


Last modified: Thu Apr 20 12:18:15 PDT 2000