CSE454 Resources

The following materials are very closely related and may be useful.

Crawlers: If your project requires you to build a crawler, please be sure to know your responsibilities. Also, please read the Mercator paper. There are three recommended strategies for incorporating crawling capabilities into your project, listed below in increasing order of sophistication.
- Scrapy a simple but powerful Python web scraping framework.
- GNU Wget is a free utility for downloading files from the Web. It's pretty basic, but could be the perfect choice if you need primarily to fetch files from one or a small number of sites.
- Heritix is an open-source, extensible, Web-scale crawler distributed by the Internet Archive. It's not quite as flexible as Nutch but even easier to get running. Particularly convenient is a Web-based UI, which lets you create and configure crawling jobs. Pretty snazzy!
- Nutch is a full featured, industrial strength, open source Web search software package. If all you need is a crawler, you can throw away the Lucene information retrieval part (which does TF/IDF and other types of ranking of documents based on an optimized, inverted-index data store). You get complete control, but through easy programming. It's really not that bad, but the others might be easier if you have limited crawling needs.
Machine Learning & IE Packages:
- Python (we recommend using version 2.7) has a robust suite of machine learning tools. Numpy offers linear algebra tools for easily working with matrices, SciPy bundles in useful scientific functions, and scikit-learn provides many commonly used machine learning algorithms. Matplotlib will allow you to visualize your results. All of these libraries can be easily installed using the pip installer built into Python.
- Pattern is a python web miniming module with google/twitter/wikipedia APIs and integrated NLP + ML
- Weka is a well-developed and simple-to-use machine learning package which is quite popular. A book provides excellent documentation, but there is stuff online as well.
- Mallet is aimed at statistical natural language processing, but has quite a bit of machine learning code built in. Specifically, it provides learning and decoding functions for conditional random fields (CRFs) which are similar to, but better than HMMs. Documentation on this functionality ("sequence tagging") is here. You may also want to read a Guide for Using Mallet written by Fei Xia.
- CRF++ is "a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking." A mini-tutorial to supplement the information on the CRF++ page is provided here.
- Reuters' webservice for named entity identification for about 20 classes of entities: opencalais.com
Natural Language Processing Libraries and Tools
- NLTK is an excellent natural language toolkit for Python. It can be installed using pip, for example by running sudo pip install -U nltk from your command line on Linux/Mac. After installing the module and importing in Python, you should run the command nltk.download() to ensure all components are available on your system.
- Stanford maintains a great list of downloadable tools for statistical NLP tasks, written in different languages.
- LingPipe is a suite of Java libraries for the linguistic analysis of human language, including information extraction and data mining tools. E.g., track mentions of entities (e.g. people or proteins); link entity mentions to database entries; uncover relations between entities and actions; classify text passages by language, character encoding, genre, topic, or sentiment; correct spelling with respect to a text collection; cluster documents by implicit topic and discover significant trends over time; and provide part-of-speech tagging and phrase chunking. A friend has used it an likes it, but I haven't played with it personally.
- OpenNLP contains a library of Java code for all sorts of NLP-tasks such as sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference resolution. Good documentation of the libraries is given on the site.
- If you just need a part-of-speech tagger you can check out Stanford's tagger. This is also in Java. The site links to a tutorial using the tagger on XML data. Stanford also provides a parser, a named entity recognizer, and a classifier that are all available separately from the Stanford NLP Group's software page.
- GATE, General Architecture for text Engineering is a NLP toolkit, which includes support for information extraction and uses a Weka interface. It appears robust and well-used, but I have no direct experience with it.
Crowdsourcing
- Crowdsourcing guidelines gives some general advice on best practices for crowdsourcing.
- Note: We will continue to expand these resources as the course progesses.
Evaluating Your Projects: There are three main ways you can evaluate your system: two pertain to the system as a whole (including the UI) and the other looks at the performance of one or more submodules. In all cases try to present your material graphically (instead of a big table). When creating such a graph, beware of Microsoft Office default templates which include gratuitous chart junk. Instead maximize the Data-Ink Ratio.
- Informal User Study of your System.
  This is the most important type of user study and the one that is most appropriate for people in this class. The basic idea is to watch a small number of people using your system in order to understand what they are trying to do, how well it works for them, what confuses them and what could be improved. It is usually followed by improvements to the UI and perhaps another evaluation in a process of iterative design improvement. One reports the user's comments and your subsequent design changes. An excellent thing to read before doing such a study is: Some techniques for Observing Users by Kathleen Gomoll. An example of a good paper which uses this technique is Summarizing Personal Web Browsing Sessions by Mira Dontcheva et al., UIST 2006.
- Formal User Study of your System.
  Once you have a polished UI design, it is common to do a more detailed study, with a larger group of subjects, looking for statistically significant results. It is unlikely that any 454 groups will have time to do this, but here is an example of one paper which (in my biased view) does such a study nicely: Improving the Performance of Motor-Impaired Users with Automatically-Generated, Ability-Based Interfaces, by Gajos, K. and Wobbrock, J. and Weld, D., CHI 2008.
- Module performance study.
  Most (if not all) groups should include at least one experiment of this form. Fortunately, with advance planning, these don't take very long. Indeed, you did something of this form with HW1 and your evaluation of the naive Byaes classifier. The trick is to plan what you will measure before you write your code. Pick a performance measure that is relevant to the system you have built: precision? recall? speed? accuracy? throughput? latency? In the simplest case, just measure this aspect of your system. Ideally, however, you will measure two versions of your system and compare the two. For example, classifier accuracy using a bag of words representation vs bag of words augmented with part of speech tags. Or throughput with and without your snazzy caching scheme. This is why it is important to plan such an experiment before you have implemented the caching mechanism - so you can easily turn it on and off. Here's one example of a paper which include results of this form: Information Extraction from Wikipedia: Moving Down the Long Tail by Fei Wu, Raphael Hoffmann, and Daniel S. Weld, KDD 2008.
Related Courses and Materials
- Manning, Raghavan & Schultze's online textbook on information retrieval.
- Craig Knoblock's course on Information Integration on the Web
- Rao Kambhampati's (ASU) Information Retrieval, Mining and Integration on the Internet (Fall'08).
- Philip Greenspun's (MIT) Software Engineering for Internet Applications.
- Craig Knoblock's (USC) Information Integration on the Web at USC/ISI (Sp'08).
- Charles Elkan's (UCSD) Web -scale information retrieval and data (Sp'08).
- Ernest Davis' (NYU) Web search engines mining (Fall'07).
- Previous instances of UW's CSE 454: this course