Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE454 Reading Assignments
  CSE Home   About Us    Search    Contact Info 

Administrivia
 Home
 Using course email
 Email archive
 Policies
Content
 Overview
 Resources
 Lecture slides
Assignments
 Reading
 Project
   

  • The best overview of a crawler:
    Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]
  • A careful paper describing a variety of architectures for building parallel crawlers. The authors propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. The results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

  • A description of the ancestor of the crawler which we are using in the project: Robert Miller's WebSphinx, implemented at CMU and originally reported in a paper in WWW7.

  • What order a crawler should use when following links?
    Efficient Crawling Through URL Ordering, Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.
  • Topic-specific crawling:
    Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery, Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.


CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX