CSE454 Resources

University of Washington Department of Computer Science & Engineering

CSE454 Resources

CSE Home

About Us

Contact Info

Administrivia

Home

Using course email

Email archive

Policies

Content

Overview

Resources

Lecture slides

Assignments

Reading

Project

The following materials are very closely related and may be useful.

Crawlers: If your project requires you to build a crawler, please be sure to know your responsibilities. Also, please read the Mercator paper. There are three recommended strategies for incorporating crawling capabilities into your project, listed below in increasing order of sophistication.

GNU Wget is a free utility for downloading files from the Web. It's pretty basic, but could be the perfect choice if you need primarily to fetch files from one or a small number of sites.
Heritix is an open-source, extensible, Web-scale crawler distributed by the Internet Archive. It's not quite as flexible as Nutch but even easier to get running. Particularly convenient is a Web-based UI, which lets you create and configure craweling jobs. Pretty snazzy!
Nutch is a full featured, industrial strength, open source Web search software package. If all you need is a crawler, you can throw away the Lucene information retrieval part (which does TF/IDF and other types of ranking of documents based on an optimized, inverted-index data store). You get complete control, but through easy programming. It's really not that bad, but the others might be easier if you have limited crawling needs.

Machine Learning & IE Packages:

Weka is a well-developed and simple-to-use machine learning package which is quite popular. A book provides excellent documentation, but there is stuff online as well.
GATE, General Architecture for text Engineering is a NLP toolkit, which includes support for information extraction and uses a Weka interface. It appears robust and well-used, but I have not experience with it.
Mallet is aimed at statistical natural language processing, but has quite a bit of machine learning code built in. Specifically, it provides learning and decoding functions for conditional random fields (CRFs) which are similar to, but better than HMMs. Documentation on this functionality ("sequence tagging") is here. You may also want to read a Guide for Using Mallet written by Fei Xia.
CRF++ is "a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking." Chloe has provided a mini-tutorial to supplement the information on the CRF++ page here.
Reuters' webservice for named entity identification for about 20 classes of entities: opencalais.com

Natural Language Processing Libraries and Tools

OpenNLP contains a library of Java code for all sorts of NLP-tasks such as sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference resolution. Good documentation of the libraries is given on the site.
If you just need a part-of-speech tagger you can check out Stanford's tagger. This is also in Java. The site links to a tutorial using the tagger on XML data. Stanford also provides a parser, a named entity recognizer, and a classifier that are all available separately from the Stanford NLP Group's software page.

Related Courses and Materials

Manning, Raghavan & Schultze's online textbook on information retrieval.
Rao Kambhampati's (ASU) Information Retrieval, Mining and Integration on the Internet (Fall'08).
Philip Greenspun's (MIT) Software Engineering for Internet Applications.
Craig Knoblock's (USC) Information Integration on the Web at USC/ISI (Sp'08).
Charles Elkan's (UCSD) Web -scale information retrieval and data (Sp'08).
Ernest Davis' (NYU) Web search engines mining (Fall'07).
Previous instances of UW's CSE 454: this course

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX