Here is the recommended reading, organized by content. See each individual
page for annotations describing the importance of the book or article
(especially useful items are marked [Req]).
Cluster-Based Scalable
Network Services by Armando Fox, Steven D. Gribble, Yatin
Chawathe, Eric A. Brewer, and Paul Gauthier. Symposium on Operating
Systems Principles (SOSP) 1997.
A careful paper describing a variety of architectures for
building parallel
crawlers. The authors propose metrics to evaluate a parallel crawler, and
compare the proposed architectures using 40 million pages collected
from the Web. The results clarify the relative merits of each
architecture and provide a good guideline on when to adopt which
architecture.
A description of the ancestor of the crawler which we used in
the 2002 project: Robert
Miller's WebSphinx,
implemented at CMU and originally reported in a paper in
WWW7.
What order a crawler should use when following links? Efficient Crawling Through URL Ordering,
Junghoo Cho, Hector Garcia-Molina and Lawrence Page, Stanford University, 1998.
Basic IR textbook Modern Information Retrieval,
R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.
Covers vector space model (section 2), precision/recall (3), inverted
files (8), and inverted file compression (7.4.5) Not as up to date, but
half the price.
The authority and hubs model: Authoritative Sources in a Hyperlinked Environment,
Jon Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
On the stability of PageRank and HITS and the connection to LSI, Link Analysis, Eigenvectors and Stability,
A. Ng, A. Zheng, and M. Jordan. IJCAI-01.
Requires some linear algebra and math bravery, but very good.
Valentin I. Spitkovsky, and Angel X. Chang. 2012.
A Cross-Lingual Dictionary for English Wikipedia Concepts. [pdfdata]
In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
the NYU
slot-filling system, which won the 2012 competition. (Probably
not as helpful a paper for 454 readers though).
Machine Learning Approaches
The Mintz/Jurafsky
paper on distant
supervision - a simple approach to distant supervision.
The MultiR system, which is much better (but a bit complicated), built here at UW. Code is available.
Another small improvement is described
by Mihai
Surdeanu; by using logistic regression with regularization,
the system maybe doesn't overfit as badly? Code is available
for download. See
also Mihai's notes
on negative examples.
The best reference for machine learning (alas it is very expensive, so
you might wish to go to the library or ask me to xerox pages for you): Machine Learning, T. Mitchell, McGraw-Hill, 1997.
The SPRINT
paper, which explains how to scale a decision tree learner to handle data
which is much longer than memory.