Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE 454 Advanced Internet and Web Services
  CSE Home   About Us    Search    Contact Info 

Administrivia
 Home
 Using course email
 Email archive
 Policies
Content
 Overview
 Resources
 Lecture slides
Assignments
 Reading
 Project
   

Project Part 1 - Crawler Speedup

Due: October 22, 12:00 noon. See below for hand-in details.

Overview

To get you started, we are providing a simple and functional crawler. Your assignment is to improve the performance of the crawler by 1) modifying the queue structure in order to allow multiple threads to search in parallel (while maintaining politeness!), 2) Implement finger-printing so duplicate pages (found via different URLs) are ignored, and 3) Implement checkpointing so the crawler can resume from a crawl after a crash. You'll conduct some simple experiments to demonstrate the effectiveness of your modifications, and you'll turn in a writeup explaining what you did. We grade both on the quality of your code, the attention to the experimental demonstration, and the quality of your writen exposition. Since there is room for many other improvements, we will award extra credit for people who come up with additional improvements and explain them well. Despite our best efforts, the code we are handing out is likely buggy. People who find and report bugs to the class mailing list help everyone, and we will track and reward this behavior as well.

Before starting this assignment, be sure to completely read the Mercator paper. The crawler we are using is an enhanced and debugged version of Robert Miller's WebSphinx, implemented at CMU and originally reported in a paper in WWW7. Unfortunately, although WebSphinx is nicely designed, it was intended for a different purpose: mapping a small website and providing an attractive graphical display. This mapping and display functionality used a huge amount of memory and rendered WebSpinx unsuitable for large-scale crawling. Previous 454 students were very unhappy with WebSphinx. As a result, we have greatly modified the code, eliminating unnecessary functionality, fixing many memory problems and making some other improvements in order to improve scalability. A document describing the resulting crawler is the best place to start reading.

Politeness

The first step in building a successful web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their sites all in the space of only several seconds. It might even be the case that large numbers of hits from the same site look like a coordinated denial of service attack! So the foremost principle you should be thinking about in this part of the project is how to be a well-behaved net citizen. A few important guidelines to follow are to make sure that your (automated) crawlers:
  • Always provide accurate and informative information as part of the crawler's user-agent filed during HTTP operations. Include your email. (put this in PoliteCrawler.java - the crawler will not run until this field is set)
  • Space out requests to any given server and do not request the same content over and over again!
  • Never hit any given site more than a few times (e.g. 10) per minute.
  • Avoid parts of a server that are restricted (see below.)
  • Don't request multiple copies of the same information from a server.
  • Don't request information you don't need (e.g. image files you don't process.)

In addition, many content providers wish to prevent access by automated spiders to certain parts of their sites. To this end, the robots exclusion protocol was created; it allows site owners to specify which parts of the site are accessible to which web robots. A good place to start reading about this is at the Web Robots Pages. There you will find links to the exclusion standard, a list of active robots, and a FAQ. To view a quick example, the CSE department has its own robots.txt file. Another page at the robots site (slightly outdated, but still useful) is the Guidelines page. In addition, you may want to take a look at David Eichmann's Ethical Web Agents paper.

Failure to observe these principles could have serious consequences:

  • Your account could get turned off by the CSE lab (this has happened in the past for exactly the same reason: complaint by an outside webmaster).
  • The class professor and TA could get reprimanded by the Department Chair.
  • If your account is disabled, it will be very difficult to do work, so you'll likely turn in a poor project and get a bad grade. Don't expect an extension.
  • If the professor and TA get in trouble because of your carelessness, how do you think they will grade your project when end of quarter comes?

Getting Started

Your work areas are in either:
/projects/instr/cse454-1
/projects/instr/cse454-2
copy /projects/instr/cse454-2/proj1 into your directory
Please use the following machines to test your crawler (especially for long runs and experiments):
(login as normal and then at the prompt ssh into one of these two machines)
rockhopper
adelie
Find your directory and copy proj1 into it. The code should provide a simple and functional crawler. Get very familiar with websphinx/Crawler.java. You will have to modify (and probably add) methods in this class. Here is an API created by javadoc that will show the other classes that you will use, but most likely just at an interface level.

Compile in the directory containing PoliteCrawler.java:

javac PoliteCrawler.java
To run the crawler:
Java -Xmx256m PoliteCrawler
The -Xmx256 option allows the use of more memory, which may be useful later, but isn't necessary to begin. If you want to brush up on Java, there are many books, but my favorite is Arnold and Gosling's.

For this assignment, please make sure not to let your crawler touch any servers outside the cs.washington.edu domain.

Most (or all) code modifications should take place within PoliteCrawler.java and websphinx/Crawler.java (of course it may be necessary to add new source/class files).

What to Hand In

Hand in the URL of a top-level web page that lists your name and contact information as well as the information described below. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. Make sure your explanations are clear and experimental results presented in a way that is easy to understand. Separate the answers to the parts below in a clear manner.

Part One: Politeness

Is the crawler polite? Explain.

To Hand In: A brief English answer (covering all aspects of politeness) and explanation. If changes are required to make the crawler polite, show them and explain.

Part Two: Focus

Modify the code so that it is possible to search from a given root URL and only visit pages in the root's directory or in a subdirectory of that containing the root. I.e. limit search to URLs whose directory stem is a superset of the root's. The current implementation starts at the given root URL (set in PoliteCrawler.java), but then only verifies that a link's host matches the root host (also set in PoliteCrawler.java). Being able to limit search in this fashion may help you debug and understand the crawling behavior, but you will eventually (only when confident) want to let your crawler roam more freely within the CS domain.

To Hand In: A listing of the code you wrote or modified for this part of the assignment. Make sure that your additions are clearly marked.

Part Three: Finger-printing

Implement finger-printing so that pages which are reached via different URLs, but which contain the same content, are ignored. Remember, you don't need to write the detailed hashing code from scratch if you can find it on the web (just be clear where you got code if you didn't write it yourself).

To Hand In: A listing of the code you wrote, downloaded, or modified for this part of the assignment. Also include a brief description of your approach. What algorithm did you use for fingerprinting and why was it a good choice?

Part Four: Parallel Crawling

Reorganize the queue datastructure so that the crawler has improved parallelism, while still ensuring politeness. Currently a simple, single queue (implemented as a Vector) is used, and it does not take advantage of any multithreading (which is seen since each thread locks the Crawler class while it fetches a link, downloads the page, examines it, and then does any updates) and may have problems with memory. Create an interface for a more sophisticated queuing system and implement a queue that takes advantage of the multiple threads. Perhaps use a separate sub-queue for each thread, using a hash function in order to ensure that no host is ever added to more than one queue (facilitating politeness). Note that each thread (Worm) has a unique ID. This may suggest an interface such as:
void BetterQueue.init(int numWorms);
void BetterQueue.add(Link, WormID); 
Link BetterQueue.remove(WormID);

Here are three other issues to consider when working on this part of the project:

  • The queue may get too large to fit in memory. Using virtual memory is probably not the best way to minimize disk access, so what is a better way? See the Mercator paper.
  • You will likely need to modify the queue structure in subsequent assignments, e.g. in order to allow heuristic search or focussed crawling. Make sure you write your code in a modular way so that you aren't tied into a FIFO policy.
  • Before you start this part of the project, be sure to think through your solution to Part Five, below

To complete this part of the project, you will need to use multiple threads and synchronization. The existing code has examples of how the threads are created and how synchronization is used. Remember that data members of Crawler are shared among the Worms (threads) so some accesses to them should be synchronized. Don't fall prey to problems such as race conditions and deadlocks! Debugging these problems is nearly impossible, so the only defence is good design. Below are some links that might be helpful.

To Hand In: A listing of the code you wrote or modified for this part of the assignment. Also include a brief description of your approach. Besides an overall explanation, please include answers to the following questions: 1) what did you synchronize and why? 2) how does your design improve crawling throughput? 3) why are you sure that the code is deadlock free? 4) how is politeness maintained?

Part Five: Checkpointing

Implement check-pointing so that a crawler can be restarted and continue at a point where it left off or crashed. You'll need to figure out how to get the threads to pause until they all get in a consistent state and then serialize all of the appropriate data structures. Provide some mechanism (such as an optional command-line argument) so that the user can set the frequency with which the checkpoints occur.

To Hand In: A listing of the code you wrote or modified for this part of the assignment. Also include a brief description of your approach. Besides an overall explanation, please include answers to the following questions: 1) did you have to do more synchronization? What and why? 2) if so, why are you sure that the code is deadlock free? 3) how did you test whether checkpointing worked correctly?

Part Six: Optional

List any other innovations (if any) which you pursued.

Part Seven: Experiments

Once you have your crawler enhanced as above, start it running on a long crawl over the CS domain, but still don't let it loose outside the CS dept. Compare its performance to the raw crawler we gave you.

To Hand In: A set of metrics comparing the performance of the two crawlers. This data should be presented in tabular format as follows. There should be a row for each statistic in the list below, and there should be a column for the basic crawler as well as one for the crawler with extensions activated.

  • CPU Time consumed
  • Number of pages visited
  • Pages crawled per minute of elapsed time
  • Min., Max. and Avg. links per page
  • Number of unique servers visited
  • How many of each type of webserver (e.g. apache, IIS, etc.) were encountered
  • Min., Max. and Avg. delays between hits to a given server
  • The number of duplicate pages found.

Groups & Collaboration

You will work alone on this first part of the project. Discussions among different students are allowed, subject to the Giligan's Island rule and other important directives listed in the class policies. Directly copying another person's work (architectures, diagrams, or code) is cheating and is considered a very serious offense.

Important Note

If you get stuck or can't complete every part of the assignment, do as much as you can. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.


CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX