Here is the recommended reading, organized by content. See each individual
page for annotations describing the importance of the book or article (you
may assume that all items are optional unless marked [Req]).
- Historical Perspective
- Networking Essentials
- Giant-Scale Services
- Web Crawlers and Spiders
- A brief summary of your responsibilities when operating a crawler
in this course [Req]
- The best overview of a crawler:
Mercator: A Scalable, Extensible Web Crawler,
Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]
- This 2004 paper is probably an excellent overview on the
holistic search engine (crawler, indexing plus query) process
Combining Systems and Databases: A Search Engine
Retrospective by Eric Brewer, co-founder of Inktomi.
- A careful paper describing a variety of architectures for
crawlers. The authors propose metrics to evaluate a parallel crawler, and
compare the proposed architectures using 40 million pages collected
from the Web. The results clarify the relative merits of each
architecture and provide a good guideline on when to adopt which
- A description of the ancestor of the crawler which we used in
the 2002 project: Robert
implemented at CMU and originally reported in a paper in
- What order a crawler should use when following links?
Efficient Crawling Through URL Ordering,
Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.
- Topic-specific crawling:
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery,
Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.
- Who links to who? a study
of web link structure, which includes "Kevin Bacon"-style analysis.
- Search Engines, Inverted Files, PageRank
- The "Google" paper:
The Anatomy Of A Large-Scale Hypertextual Web Search Engine,
Sergey Brin and Lawrence Page, Stanford University, 1999. [Req]
- How to implement PageRank Efficiently
- Basic IR textbook
Modern Information Retrieval,
R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.
Covers vector space model (section 2), precision/recall (3), inverted
files (8), and inverted file compression (7.4.5)
- Discussion of Latent Semantic Indexing
introduction to principal components analysis (used in lsi)
- The authority and hubs model:
Authoritative Sources in a Hyperlinked Environment,
Jon Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
- On the stability of PageRank and HITS and the connection to LSI,
Link Analysis, Eigenvectors and Stability,
A. Ng, A. Zheng, and M. Jordan. IJCAI-01.
Requires some linear algebra and math bravery, but very good.
- The "search engine"-related web site:
Search Engine Watch,
- A short paper on snippet generation
- Learning, data mining, personalization
- Information Extraction
- Web Services & XML protocols
overview of Web
- Examples of Web Services:
- A good introduction to data integration
- A high-level vision for the Semantic Web[Req]
- Security & E-Commerce
- Hazards: Spam, viruses, spyware and the like