CSE454 Reading Assignments

Here is the recommended reading, organized by content. See each individual page for annotations describing the importance of the book or article (especially useful items are marked [Req]).

HISTORICAL PERSPECTIVE
- A historical prospective on the advancements leading to the Internet:
  NERDS - A Brief History Of The Internet, Stephen Segaller, Nov 1999.
- Brian Pinkerton's history of WebCrawler, the first Internet Search Engine. (A project in my class!).
- Web 2.0
GIANT-SCALE SERVICES
- Lessons from Giant-Scale Services by Eric Brewer, IEEE Computer, 2001. [Req]
- Web Search for a Planet: The Google Cluster Architecture, Luiz Barroso, Jeffrey Dean, and Urs Hoelzle
- MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
- Cluster-Based Scalable Network Services by Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Symposium on Operating Systems Principles (SOSP) 1997.
- The Google File System by Sanjay Ghemawat et al. SOSP 2003.
- Peer-to-Peer Systems
  - Stanford page on the topic. Links to presentations and papers.
  - Research paper on Berkeley's CHORD
  - A survey of Darknets
WEB CRAWLERS and SPIDERS
- A brief summary of your responsibilities when operating a crawler in this course [Req]
- The best overview of a crawler:
  Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]
- This 2004 paper is probably an excellent overview on the holistic search engine (crawler, indexing plus query) process Combining Systems and Databases: A Search Engine Retrospective by Eric Brewer, co-founder of Inktomi.
- A careful paper describing a variety of architectures for building parallel crawlers. The authors propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. The results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.
- A description of the ancestor of the crawler which we used in the 2002 project: Robert Miller's WebSphinx, implemented at CMU and originally reported in a paper in WWW7.
- What order a crawler should use when following links?
  Efficient Crawling Through URL Ordering, Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.
- Topic-specific crawling:
  Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery, Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.
- Who links to who? a study of web link structure, which includes "Kevin Bacon"-style analysis.
SEARCH ENGINES, INVERTED FILES, PAGERANK
- The Dirty Little Secrets of Search by David Segal, New York Times.
- Online textbook on Information Retrieval by Manning, Raghavan, and Schutze (also available in hardcover).
- Good textbook Search Engines: Information Retrieval in Practice but it costs $73.
- Basic IR textbook
  Modern Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.
  Covers vector space model (section 2), precision/recall (3), inverted files (8), and inverted file compression (7.4.5) Not as up to date, but half the price.
- The "Google" paper:
  The Anatomy Of A Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, Stanford University, 1999. An extended version of the WWW-98 paper. [Req]
- Challenges in Web Search Engines , by Henzinger, Motwani, and Silverstein. ACM SIGIR Forum 36(2), 2002.
- How to implement PageRank Efficiently [Req]
- Discussion of Latent Semantic Indexing
  How LSI Works
  Visual introduction to principal components analysis (used in lsi)
- The authority and hubs model:
  Authoritative Sources in a Hyperlinked Environment, Jon Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
- On the stability of PageRank and HITS and the connection to LSI,
  Link Analysis, Eigenvectors and Stability, A. Ng, A. Zheng, and M. Jordan. IJCAI-01.
  Requires some linear algebra and math bravery, but very good.
- The "search engine"-related web site:
  Search Engine Watch, Danny Sulivan.
- A short paper on snippet generation

INFORMATION EXTRACTION

Survey on Information Extraction by Ralph Grishman
Named Entity Linking
- Good intro paper: Milne & Witten CIKM-08
- Great paper, but technical, by Kulkarni et al
- A longer paper which compares three approaches to NEL, offering ideas for how to improve them.
- A very handy resource for NEL, derived from a Google crawl of hyperlinks to Wikipedia
- A very clear paper on the Stanford Coreference System - a task which is related to NEL.
- Valentin I. Spitkovsky, and Angel X. Chang. 2012. A Cross-Lingual Dictionary for English Wikipedia Concepts. [pdf data] In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
TAC KBP Competition Systems
- Analysis of the 2010 KBP competition systems
- Paper describing successful approaches & challenges after the 2010 competition.
- The Stanford slot-filling system emphasizes distant supervision.
- the NYU slot-filling system, which won the 2012 competition. (Probably not as helpful a paper for 454 readers though).
Machine Learning Approaches
- The Mintz/Jurafsky paper on distant supervision - a simple approach to distant supervision.
- The MultiR system, which is much better (but a bit complicated), built here at UW. Code is available.
- Another small improvement is described by Mihai Surdeanu; by using logistic regression with regularization, the system maybe doesn't overfit as badly? Code is available for download. See also Mihai's notes on negative examples.
- ML Features for Relation Extraction
  - A 2007 paper Systematic exploration of the feature space for relation extraction by Jiang & Zhai.
  - The Mintz/Jurafsky paper on distant supervision which defines a set of features which have been widely adopted (tho they are likely not that great).
  - The Stanford SU Time System
Useful technologies
- Very clear paper on a high-performance yet simple Coreference disambigaution system.
- L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, IEEE-1989.
- An Introduction to Conditional Random Fields for Relational Learning. Charles Sutton and Andrew McCallum. Book chapter in Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006.
- D. Freitag and A. McCallum, Information extraction with HMM structures learned by stochastic optimization AAAI-2000.
Question Answering on the Web
- The Mulder paper
- The NSIR
The UW KnowItAll Project
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A-M., Shaked, T., Soderland, S., Weld, D. and Yates, A., "Unsupervised Named-Entity Extraction from the Web: An Experimental Study" Artificial Intelligence, 165(1)91-134, 2005.

LEARNING, DATA MINING, PERSONALIZATION

Interpreting the Data: Parallel Analysis with Sawzall (Draft), Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan
The best reference for machine learning (alas it is very expensive, so you might wish to go to the library or ask me to xerox pages for you):
Machine Learning, T. Mitchell, McGraw-Hill, 1997.
The SPRINT paper, which explains how to scale a decision tree learner to handle data which is much longer than memory.
Chumki Basu, Haym Hirsh, and William W. Cohen (1998).
Recommendation as Classification: Using Social and Content-Based Information in Recommendation. (AAAI98)
The original paper describing RIPPER, a fast rule learner which has proven to be a good tool for learning from unstructured text.
Two approaches to organizing web search results:
- Hierarchical Classification of Web Content (this is Sue Dumais work at Microsoft)
- A Dynamic Clustering Interface to Web Search Results, Zamir, O. and Etzioni, O., WWW-8, 1999.

USABILITY & INTERFACES

Practical guide to controlled experiments on the web: listen to your customers not to the HiPPO by Kohavi, Henne, and Sommerfield. KDD-07. [Req]
Faceted metadata for image search and browsing by Ka-Ping Yee, Kirsten Swearingen, Kevin Li and Marti Hearst. CHI-03. 2003.
AJAX:
- Wikipedia entry
- Very Dynamic Web Interfaces

ADVERTISING

Finding Advertising Keywords on Web Pages Yih, Goodman, and Carvalho. WWW-06, Scotland, 2006.
Practical guide to controlled experiments on the web: listen to your customers not to the HiPPO by Kohavi, Henne, and Sommerfield. KDD-07. [Req]

SOCIAL COMPUTING

DARPA Network Challenge Final Report DARPA, Feb 16, 2010
Labeling Images with a Computer Game von Ahn and Dabbish, CHI-04.
CityVille Explained, Part 1 and Part 2 by Tadhg Kelly, Gamasutra. Why Zynga's Cityville Gained 70M users in 4 weeks.

SECURITYu & E-COMMERCE

A great Bitcoin Primer [Req]
The original bitcoin paper - short and sweet.
A more technical analysis of the bitcoin protocol.
Kevin Fu, Emil Sit, Kendra Smith, and Nick Feamster, Dos and Don'ts of Client Authentication on the Web, MIT Tech Report 818, May 2001. [Req]
The classic (1978) paper by Rivest, Shamir and Adleman on public key cryptography: A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Still a good read.
A Cryptography FAQ by RSA Data Security. [Req]
The Black market for credit cards and identities

HAZARDS: SPAM, VIRUSES, SPYWARE and the like

The Linguistics of Link Spam
A Measurement and Analysis of Spyware in a University Environment: Stefan Saroiu, Steven D. Gribble, and Henry M. Levy. Proceedings of the First Symposium on Networked Systems Design and Implementation (NSDI), March 2004.
A Crawler-based Study of Spyware on the Web: Alexander Moschuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy
Personal data for the taking