A careful paper describing a variety of architectures for
crawlers. The authors propose metrics to evaluate a parallel crawler, and
compare the proposed architectures using 40 million pages collected
from the Web. The results clarify the relative merits of each
architecture and provide a good guideline on when to adopt which
A description of the ancestor of the crawler which we used in
the 2002 project: Robert
implemented at CMU and originally reported in a paper in
Basic IR textbook Modern Information Retrieval,
R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.
Covers vector space model (section 2), precision/recall (3), inverted
files (8), and inverted file compression (7.4.5) Not as up to date, but
half the price.
The authority and hubs model: Authoritative Sources in a Hyperlinked Environment,
Jon Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
On the stability of PageRank and HITS and the connection to LSI, Link Analysis, Eigenvectors and Stability,
A. Ng, A. Zheng, and M. Jordan. IJCAI-01.
Requires some linear algebra and math bravery, but very good.
Valentin I. Spitkovsky, and Angel X. Chang. 2012.
A Cross-Lingual Dictionary for English Wikipedia Concepts. [pdfdata]
In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).