|
CSE Home | About Us | Search | Contact Info |
|
Using CRF++CRF++ is a simple, flexible, open-source implementation of Conditional Random Fields (CRFs) that you might want to use in your projects. If all you need is a CRF for extractions, this may be a better choice than some of the larger multi-purpose machine learning libraries (such as Mallet - which may be better if you need more advanced tools). Downloading CRF++You can get the software from http://crfpp.sourceforge.net/ either as a binary for Windows machines, or the source code. You can compile the source code on any machine. It took me no time at all to get it up and running on my macbook. The website has usage information and explains how to set up your training, test and model files. It is pretty straightforward. Most of the information on this page is copied in some part from that. Hopefully between this and the CRF++ website, you will be alright! Compiling CRF++On your own machineTo configure and compile the c++ source code be sure to have a c++ compiler on your machine and type:./configure After that runs, you can make the source code by running: make To install the program, you need to switch to the root user of the machine. On my mac I can do this by running: sudo make install and entering my password. The CRF++ website says to run: su
make install
I'm not sure what will be best for your machine.On cubist (or another UW machine)Since you don't have root access on this machine, you won't be able to run su/sudo.To fix this, edit the configure command to be: ./configure --prefix=/your_project_directory/
Then, you can run make install without needing to run su/sudo.
Configuring your training data file and template fileYour training data file should consist of sentences (or sequences of tokens) that you want to label. Each token/word in a sentence gets its own line and there is a blank line between sentences: Test
Each token line also holds all the features of that token. This could be the token's part of speech, the type of entity it is, the first two characters of the word, etc. Features are separates by spaces. Also,the true label for the token is written at the end of the line.Test adjective T WORD
It is important to note that you must have the same number of features per line and the features must be written in the same order on each line. For example in the above training example all the token lines would have the same format:Word Part_of_speech First_character LABEL Notice also, that the features on each line are specific to only that token (I will explain how to add window features and more complex features in a moment). If you want to have global features you will need to add them to each line. So if a sentence can come from a type A page or a type B page, you would label each token with that feature: Test adjective T TYPEB WORD
Now that you know how your training data file will look, you can learn how to specify what features exactly you want in your CRF model. The template file(Note: this is explained MUCH better on the CRF++ website. I am just giving an overview. I suggest you go to the CRF++ site to read up on the template file syntax. It isn't too terrible.)The template file sets up which features to use during a run of the CRF. Looking at the training file as a table, if you are currently at the line for the word "one" in the previous example, x%[0,0] represents the current word. Likewise, x%[0,1] is the current part of speech, and x%[0,2] is the current word's first character. Then, x%[-1,0] would be the previous word, x%[-1, 1] would be the previous word's part of speech, x%[1,0] would be the next word and so on. Basically if you have a template file that says:
U0:%x[0,0]
then at any token that the CRF examines it will take into account that word, the word's POS, the word's first character and the previous and next words when making a decision about what label to give the token. The "U"s are the label names. They need to have "U" as the first character for unigram features. The numbers are arbitrary. You can create unigram features that are combinations of features such as:U5:%x[0,0]/%x[0,1] which creates features that are word/pos such as "two/noun". To automatically create bigram features you can add a line that says B to the template file. This will create bigram features between the Ux of the last token and the Ux of the current token.
Training and TestingTo train a CRF on a training data set using a specific template file just run: crf_learn template_file train_file model_file
This will output a CRF model into the model_file file. Then, to run on test data, you can run: crf_test -m model_file test_files ...
crf_test prints out each token in the test file with its features (and true label if given) and the label given to the token by the crf model. A test file looks just like a training file. However, you can omit the labels on the ends of the lines. If the labels stay, you can learn the precision and recall of your model. However, since this is testing data and could be real data from websites, you may not have these labels which is fine. You can set specific error rates and iterations for learning and ways to format the output of testing. The options for crf_learn and crf_test are documented on the CRF++ website. ExamplesThe CRF++ download has a few examples for you to play with. I also wrote a toy example for information extraction assuming the line features: word part_of_speech entity_type page_type(assuming A or B) LABEL You can look at my files here: Information Extraction |
Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX |