CSE454 Reading Assignments

University of Washington Department of Computer Science & Engineering

CSE Home

About Us

Contact Info

Administrivia

Home

Using course email

Email archive

Policies

Content

Overview

Resources

Lecture slides

Assignments

Reading

Project

Here is the recommended reading, organized by content. See each individual page for annotations describing the importance of the book or article (you may assume that all items are optional unless marked [Req]).

Historical Perspective

A historical prospective on the advancements leading to the Internet:
NERDS - A Brief History Of The Internet, Stephen Segaller, Nov 1999.

Brian Pinkerton's history of WebCrawler, the first Internet Search Engine. (A project in my class!).

Networking Essentials

A comprehensive book on networking (and expensive too):
Computer Networks - A Systems Approach, Larry Peterson and Bruce Davie, 2nd edition, 1999.

Informal introduction to the Domain Name System (a.k.a. DNS):
Exploring the Domain Name Space; DNS in Action, Kristin Windbigler, Jan 1997.

Informal introduction to the HTTP protocol or What happens when you click on a link?
HTTP Transactions and You; One Click = Several Requests, Dean Gaudet, Feb 1997.

A comprehensive guide on how to write HTTP clients and servers:
HTTP Made Really Easy: A Practical Guide To Writing Clients And Servers, James Marshall, Apr 1997. [Req]

Giant-Scale Services

Lessons from Giant-Scale Services by Eric Brewer, IEEE Computer, 2001.
Cluster-Based Scalable Network Services by Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Symposium on Operating Systems Principles (SOSP) 1997.
The Google File System by Sanjay Ghemawat et al. SOSP 2003.

Web Crawlers and Spiders

A brief summary of your responsibilities when operating a crawler in this course [Req]
The best overview of a crawler:
Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Mark Najork, Compaq SRC, June 1999. [Req]
This 2004 paper is probably an excellent overview on the holistic search engine (crawler, indexing plus query) process Combining Systems and Databases: A Search Engine Retrospective by Eric Brewer, co-founder of Inktomi.

A careful paper describing a variety of architectures for building parallel crawlers. The authors propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. The results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.
A description of the ancestor of the crawler which we used in the 2002 project: Robert Miller's WebSphinx, implemented at CMU and originally reported in a paper in WWW7.
What order a crawler should use when following links?
Efficient Crawling Through URL Ordering, Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.

Topic-specific crawling:
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery, Soumen Chakrabarti, Martin van den Berg and Byron Dom, Elsevier Science B.V., 1999.
Who links to who? a study of web link structure, which includes "Kevin Bacon"-style analysis.

Search Engines, Inverted Files, PageRank

The "Google" paper:
The Anatomy Of A Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, Stanford University, 1999. [Req]

How to implement PageRank Efficiently
Basic IR textbook
Modern Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.
Covers vector space model (section 2), precision/recall (3), inverted files (8), and inverted file compression (7.4.5)

Discussion of Latent Semantic Indexing
How LSI Works
Visual introduction to principal components analysis (used in lsi)
The authority and hubs model:
Authoritative Sources in a Hyperlinked Environment, Jon Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.

On the stability of PageRank and HITS and the connection to LSI,
Link Analysis, Eigenvectors and Stability, A. Ng, A. Zheng, and M. Jordan. IJCAI-01.
Requires some linear algebra and math bravery, but very good.

The "search engine"-related web site:
Search Engine Watch, Danny Sulivan.

A short paper on snippet generation

Learning, data mining, personalization

The best reference for machine learning (alas it is very expensive, so you might wish to go to the library or ask me to xerox pages for you):
Machine Learning, T. Mitchell, McGraw-Hill, 1997.
The SPRINT paper, which explains how to scale a decision tree learner to handle data which is much longer than memory.
Naïve Bayes and Nearest Neighbor by Estelle Brand and Rob Gerritsen
Chumki Basu, Haym Hirsh, and William W. Cohen (1998).
Recommendation as Classification: Using Social and Content-Based Information in Recommendation. (AAAI98)
The original paper describing RIPPER, a fast rule learner which has proven to be a good tool for learning from unstructured text.
Two approaches to organizing web search results:

Hierarchical Classification of Web Content (this is Sue Dumais work at Microsoft)
A Dynamic Clustering Interface to Web Search Results, Zamir, O. and Etzioni, O., WWW-8, 1999.

Visualizing weblogs by learning markov models short and medium length written versions.

Information Extraction

D. Freitag and A. McCallum, Information extraction with HMM structures learned by stochastic optimization AAAI-2000.
The UW KnowItAll Project
Question Answering on the Web
The Mulder paper
The NSIR

Web Services & XML protocols

An overview of Web Services

XML itself: a A high-level introduction and the detailed specification (W3C Recommendation)
Details on XSL, XLinks, and XPointers
A tutorial on SOAP
Both overview and technical material on UDDI
Introduction to WSDL

Examples of Web Services:

Amazon

Amazon Light
The Amazon Browser
Digital Camera Comparison Shopper
Overview news article

Google

A good introduction to data integration
A high-level vision for the Semantic Web[Req]

Security & E-Commerce

Kevin Fu, Emil Sit, Kendra Smith, and Nick Feamster, Dos and Don'ts of Client Authentication on the Web, MIT Tech Report 818, May 2001. [Req]
The classic (1978) paper by Rivest, Shamir and Adleman on public key cryptography: A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Still a good read.
A Cryptography FAQ by RSA Data Security. [Req]
The Black market for credit cards and identities

Hazards: Spam, viruses, spyware and the like

The Linguistics of Link Spam
A Measurement and Analysis of Spyware in a University Environment Stefan Saroiu, Steven D. Gribble, and Henry M. Levy. Proceedings of the First Symposium on Networked Systems Design and Implementation (NSDI), March 2004.
Personal data for the taking

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX