Here is the recommended reading, organized by content. See each individual
page for annotations describing the importance of the book or article (you
may assume that all items are optional unless marked [Req]).
- Historical Perspective
- Networking Essentials
- Giant-Scale Services
- Lessons from Giant-Scale
Services by Eric Brewer, IEEE Computer, 2001. [Req]
- Web Search for a Planet: The Google Cluster Architecture,
Luiz Barroso, Jeffrey Dean, and Urs Hoelzle
- MapReduce: Simplified Data Processing on Large Clusters,
Jeffrey Dean and Sanjay Ghemawat
- Cluster-Based Scalable
Network Services by Armando Fox, Steven D. Gribble, Yatin
Chawathe, Eric A. Brewer, and Paul Gauthier. Symposium on Operating
Systems Principles (SOSP) 1997.
- The Google File System by
Sanjay Ghemawat et al. SOSP 2003.
- Search Engines, Inverted Files, PageRank
- The "Google" paper:
The Anatomy Of A Large-Scale Hypertextual Web Search Engine,
Sergey Brin and Lawrence Page, Stanford University, 1999. [Req]
- Challenges in Web Search Engines , by Henzinger, Motwani, and Silverstein. ACM SIGIR Forum 36(2), 2002.
- How to implement PageRank Efficiently [Req]
- Basic IR textbook
Modern Information Retrieval,
R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.
Covers vector space model (section 2), precision/recall (3), inverted
files (8), and inverted file compression (7.4.5)
Reading for IR Students
- Discussion of Latent Semantic Indexing
introduction to principal components analysis (used in lsi)
- The authority and hubs model:
Authoritative Sources in a Hyperlinked Environment,
Jon Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
- On the stability of PageRank and HITS and the connection to LSI,
Link Analysis, Eigenvectors and Stability,
A. Ng, A. Zheng, and M. Jordan. IJCAI-01.
Requires some linear algebra and math bravery, but very good.
- The "search engine"-related web site:
Search Engine Watch,
- A short paper on snippet generation
- Web Crawlers and Spiders
- A brief summary of your responsibilities when operating a crawler
in this course [Req]
- The best overview of a crawler:
Mercator: A Scalable, Extensible Web Crawler,
Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]
- This 2004 paper is probably an excellent overview on the
holistic search engine (crawler, indexing plus query) process
Combining Systems and Databases: A Search Engine
Retrospective by Eric Brewer, co-founder of Inktomi.
- A careful paper describing a variety of architectures for
crawlers. The authors propose metrics to evaluate a parallel crawler, and
compare the proposed architectures using 40 million pages collected
from the Web. The results clarify the relative merits of each
architecture and provide a good guideline on when to adopt which
- A description of the ancestor of the crawler which we used in
the 2002 project: Robert
implemented at CMU and originally reported in a paper in
- What order a crawler should use when following links?
Efficient Crawling Through URL Ordering,
Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.
- Topic-specific crawling:
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery,
Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.
- Who links to who? a study
of web link structure, which includes "Kevin Bacon"-style analysis.
- Learning, data mining, personalization
- Interpreting the Data: Parallel Analysis with Sawzall (Draft),
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan
- The best reference for machine learning (alas it is very expensive, so
you might wish to go to the library or ask me to xerox pages for you):
Machine Learning, T. Mitchell, McGraw-Hill, 1997.
- The SPRINT
paper, which explains how to scale a decision tree learner to handle data
which is much longer than memory.
- Naïve Bayes and
Nearest Neighbor by Estelle Brand and Rob Gerritsen
- Chumki Basu, Haym Hirsh, and William W. Cohen (1998).
Recommendation as Classification:
Using Social and Content-Based Information in Recommendation. (AAAI98)
- The original paper describing RIPPER, a fast
rule learner which has proven to be a good tool for learning from
- Two approaches to organizing web search results:
- Visualizing weblogs by learning markov models short and medium
length written versions.
- Information Extraction
- L. R. Rabiner, A Tutorial on
Hidden Markov Models and Selected Applications in Speech
- An Introduction to
Conditional Random Fields for Relational Learning. Charles
Sutton and Andrew
McCallum. Book chapter in
Introduction to Statistical Relational Learning. Edited by Lise Getoor
and Ben Taskar. MIT Press. 2006.
- D. Freitag and A. McCallum, Information extraction with HMM structures
learned by stochastic optimization AAAI-2000.
- The UW KnowItAll Project
- Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A-M.,
Shaked, T., Soderland, S., Weld, D. and Yates, A., "Unsupervised
Named-Entity Extraction from the Web: An Experimental Study"
Artificial Intelligence, 165(1)91-134, 2005.
- Question Answering on the Web
- Web Services & XML protocols
overview of Web
- Examples of Web Services:
- A good introduction to data integration
- A high-level vision for the Semantic Web [Req]
- Security & E-Commerce
- Hazards: Spam, viruses, spyware and the like
- Peer-to-Peer Systems
- Stanford page on the
topic. Links to presentations and papers.
- Research paper on Berkeley's CHORD
- A survey of Darknets