CSE454 Event Extraction Project

University of Washington Department of Computer Science & Engineering

CSE Home

About Us

Contact Info

Feel free to adapt this project in any of a myriad of ways. I'll describe it one way, aiming for concreteness, but don't consider it written in stone.
The primary deliverable would be a Web service that as input takes natural language (English) text, eg a news article and outputs a set of events that are described in the story. For example, when input a sentence like
The assault on the Paris offices of Charlie Hebdo, a French newspaper that has repeatedly satirized religion, was one of the deadliest in a history of violent responses and threats against the news media over the mockery of Islam.
the system might output:
*unknown* ATTACK "Paris offices of Charlie Hebdo"
a baseline event extraction system can be built on top of the href="http://reverb.cs.washington.edu/">ReVerb open information extraction system, which is an easy to use, fast and (reasonably) robust system developed here at UW. Reverb extracts triples from text, but doesn't normalize "relation phrases" which denote events in our example. ALso it extracts some triples that don't corespond to events at all, e.g. from a nutrition page it might extract ORANGES CONTAIN VITAMIN-C.
So the projects first task would be to build a classifier that determines if a relation phrase corresponds to an event and if so, which event. We'll provide a set of 40 events from the DARPA "Event Nugget" competition. I'd suggest that you start with a subset of these and that you first write a classifier that uses human generated, hand coded rules to do this classification. In parallel (or afterwards), you could use machine learning (which we'll explain in this course) to train a classifier from labeled training data. The ML code can be an easy download, e.g. from Weka or if you want you can build your own.
We will provide some training data, you can also (if you want) use crowdsourcing to create training data. In the next week, we'll be showing how to do this in class.

Taxonomy of Events

LIFE

BE-BORN
MARRY
DIVORCE
INJURE
DIE

MOVEMENT

TRANSPORT

TRANSACTION

TRANSFER-OWNERSHIP
TRANSFER-MONEY

BUSINESS

START-ORG
MERGE-ORG
DECLARE-BANKRUPTCY
END-ORG

CONFLICT

ATTACK
DEMONSTRATE

CONTACT

MEET
PHONE-WRITE

PERSONELL

START-POSITION
END-POSITION
NOMINATE
ELECT

JUSTICE

ARREST-JAIL
RELEASE-PAROLE
TRIAL-HEARING
CHARGE-INDICT
SUE
CONVICT
SENTENCE
FINE
EXECUTE
EXTRADITE
ACQUIT
APPEAL
PARDON

Data

Human-readable file including annotations and Reverb extractions here. How to read this file here.
Parsed sentence file, including the results of the Stanford parser and Reverb here. How to read this file here.
Enhancements
Time should permit you to extend the baseline system in one or more ways as your interest directs. Some ideas include:

It appears that many events in the training set are actually described as noun phrases, so if you do just ReVerb, then your recall will be low. One way to find more events (not their arguments tho) is to look for nominals that correspond to events. I.e. instead of the verb "exploded" look for the noun "the explosion"; instead of "acquired" look for "the acquisition". An off-the-shelf part-of-speech tagger can identify noun phrases and nominals would be useful features in a classifier. We can give you a list of verbs that we've found corresponding to certain event classes and you can try and automate their conversion into nominals. This could be done in two ways: 1) using WordNet somehow, or 2) crowdsourcing.
Build a Web front end (or app) for the system, that allows someone to paste text from a news story of interest (or paste a URL) and then runs the extractor and displays the results.
Extend the system to handle more of the 40 event types (or create your own event types)
Open information extraction provides the subject are objects as part of it's triples, but these are treated as text. For your first baseline, you can ignore these and just deal with the event type. As an extension you can try to improve the subject and arguments - or use features of the subject and object to help you do the classifiaction. Here's one idea, run the FIGER fine grained entity recognition system on the subjects and objects and determine their types (from a predefinded set of 110 types). For example if an event phrase is "demolished" then that could mean an attack. But if the subject is a sports team (recognized by FIGER) then it's indicative of a sporting event instead.
Really focus on the crowdsourcing aspect. Measure the quality of data produced.
Your ideas here.

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX