    Building or using a web crawler is a dangerous business. As past students (and their instructors / advisors / department chairs and university presidents) have learned all too painfully, a small design error or coding bug can causes a crawler to hammer a website with thousands of requests in a short period of time.

To a webmaster, this looks like a Denial of Service Attack.

Even mild versions of this type of crawler behavior can get some touchy web masters very upset. In the recent past it has lead to email to numerous faculty, department chairs and even state officials. It has led to serious embarrassment for the advisor of at least one student. This faculty member would have preferred that the University President didn't know his name.

How do you think the professor felt towards his student? What grade do you think the student got?

Since I don't like angry emails, I insist upon extremely careful design and operator of all web software written for this course. In particular, I want you to very craefully follow these guidelines. Plus,

  • Ensure that your crawler refrain from hitting any one domain more than six times per minute. (Note that this is tricky if your spider is multithreaded).

