CSE454 Reading Assignments

University of Washington Department of Computer Science & Engineering

CSE Home

About Us

Contact Info

Administrivia

Home

Using course email

Email archive

Policies

Content

Overview

Resources

Lecture slides

Assignments

Reading

Project

The best overview of a crawler:
Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]

A careful paper describing a variety of architectures for building parallel crawlers. The authors propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. The results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

A description of the ancestor of the crawler which we are using in the project: Robert Miller's WebSphinx, implemented at CMU and originally reported in a paper in WWW7.

What order a crawler should use when following links?
Efficient Crawling Through URL Ordering, Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.

Topic-specific crawling:
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery, Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX