|
|
|
|
The following materials are very closely related and may be useful.
-
Crawlers: If your project requires you to build a crawler, please
be sure to know
your responsibilities. Also, please read the Mercator paper. There are three recommended
strategies for incorporating crawling capabilities into your project,
listed below in increasing order of sophistication.
- GNU
Wget is a free utility for downloading files from the Web. It's pretty
basic, but could be the perfect choice if you need primarily to fetch files
from one or a small number of sites.
- Heritix is an open-source,
extensible, Web-scale crawler distributed by the Internet Archive. It's not
quite as flexible as Nutch but even easier to get running. Particularly
convenient is a Web-based
UI, which lets you create and configure craweling jobs. Pretty snazzy!
- Nutch is a
full featured, industrial strength, open source Web search software
package. If all you need is a crawler, you can throw away the Lucene
information retrieval part (which does TF/IDF and other types of ranking of
documents based on an optimized, inverted-index data store).
You get complete control, but through easy programming. It's
really not that bad, but the others might be easier if you have limited
crawling needs.
- Machine Learning & IE Packages:
- Weka is a
well-developed and simple-to-use machine learning package which is quite
popular. A book provides excellent documentation, but there is stuff
online as well.
- GATE, General Architecture for text
Engineering is a NLP toolkit, which includes support for information
extraction and uses a Weka interface. It appears robust and well-used, but
I have not experience with it.
- Mallet is aimed at
statistical natural language processing, but has quite a bit of machine
learning code built in. Specifically, it provides learning and decoding
functions for conditional random fields (CRFs) which are similar to, but
better than HMMs. Documentation on this functionality ("sequence tagging")
is here.
You may also want to read a Guide for Using Mallet written by Fei Xia.
- CRF++ is "a simple,
customizable, and open source implementation of Conditional Random Fields
(CRFs) for segmenting/labeling sequential data. CRF++ is designed for
generic purpose and will be applied to a variety of NLP tasks, such as
Named Entity Recognition, Information Extraction and Text Chunking."
Chloe has provided a mini-tutorial to supplement the information on the CRF++ page here.
- Reuters' webservice for named entity identification for about 20 classes of
entities: opencalais.com
- Natural Language Processing Libraries and Tools
- OpenNLP contains a library of Java code for all sorts of NLP-tasks such as sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference resolution. Good documentation of the libraries is given on the site.
- If you just need a part-of-speech tagger you can check out Stanford's tagger. This is also in Java. The site links to a tutorial using the tagger on XML data. Stanford also provides a parser, a named entity recognizer, and a classifier that are all available separately from the Stanford NLP Group's software page.
- Related Courses and Materials
|