Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE454 Responsibilities
  CSE Home   About Us    Search    Contact Info 

Administrivia
 Home
 Using course email
 Email archive
 Policies
Content
 Overview
 Resources
 Lecture slides
Assignments
 Reading
 Project
    Building or using a web crawler is a dangerous business. As past students (and their instructors / advisors / department chairs and university presidents) have learned all too painfully, a small design error or coding bug can causes a crawler to hammer a website with thousands of requests in a short period of time.

To a webmaster, this looks like a Denial of Service Attack.

Even mild versions of this type of crawler behavior can get some touchy web masters very upset. In the recent past it has lead to email to numerous faculty, department chairs and even state officials. It has led to serious embarrassment for the advisor of at least one student. This faculty member would have preferred that the University President didn't know his name.

How do you think the professor felt towards his student? What grade do you think the student got?

Since I don't like angry emails, I insist upon extremely careful design and operator of all web software written for this course. In particular, I want you to do the following:

  • Make sure that your crawler obey the conventions of robots.txt;
  • Ensure that your crawler refrain from hitting any one domain more than five times per minute. (Note that this is tricky if your spider is multithreaded).
  • Provide a way for crawleld sites to contact somebody who can shut down a renegade crawler (one good way is to include email contact info in the "user agent" string);
  • Alert lab staff when these things are crawling, and provide information that helps us shut them down quickly if we receive complaints from other sites. (Web-based mechanism for this to be detailed soon).


CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX