|
|
|
|
- The best overview of a crawler:
Mercator: A Scalable, Extensible Web Crawler,
Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]
- A careful paper describing a variety of architectures for
building parallel
crawlers. The authors propose metrics to evaluate a parallel crawler, and
compare the proposed architectures using 40 million pages collected
from the Web. The results clarify the relative merits of each
architecture and provide a good guideline on when to adopt which
architecture.
- A description of the ancestor of the crawler which we are using in
the project: Robert
Miller's WebSphinx,
implemented at CMU and originally reported in a paper in
WWW7.
- What order a crawler should use when following links?
Efficient Crawling Through URL Ordering,
Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.
- Topic-specific crawling:
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery,
Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.
|