Feel free to adapt this project in any of a myriad of ways. I'll
describe it one way, aiming for concreteness, but don't consider it
written in stone.
The primary deliverable would be a Web service that as input takes natural
language (English) text, eg a news article and outputs a set of events
that are described in the story. For example, when input a sentence
like
The assault on the Paris offices of Charlie Hebdo, a French newspaper
that has repeatedly satirized religion, was one of the deadliest in a
history of violent responses and threats against the news media over
the mockery of Islam.
the system might output:
*unknown* ATTACK "Paris offices of Charlie
Hebdo"
a baseline event extraction system can be built on top of
the href="http://reverb.cs.washington.edu/">ReVerb open information
extraction system, which is an easy to use, fast and
(reasonably) robust system developed here at UW. Reverb extracts
triples from text, but doesn't normalize "relation phrases" which
denote events in our example. ALso it extracts some triples that
don't corespond to events at all, e.g. from a nutrition page it might
extract ORANGES CONTAIN VITAMIN-C.
So the projects first task would
be to build a classifier that determines if a relation phrase
corresponds to an event and if so, which event. We'll provide a set
of 40 events from the DARPA "Event Nugget" competition. I'd suggest
that you start with a subset of these and
that you first write a classifier that uses human generated, hand
coded rules to do this classification. In parallel (or afterwards),
you could use machine learning (which we'll explain in this course) to
train a classifier from labeled training data. The ML code can be an
easy download,
e.g. from Weka or
if you want you can build your own.
We will provide some training
data, you can also (if you want) use crowdsourcing to create training
data. In the next week, we'll be showing how to do this in class.
Taxonomy of Events
- LIFE
- BE-BORN
- MARRY
- DIVORCE
- INJURE
- DIE
- MOVEMENT
- TRANSPORT
- TRANSACTION
- TRANSFER-OWNERSHIP
- TRANSFER-MONEY
- BUSINESS
- START-ORG
- MERGE-ORG
- DECLARE-BANKRUPTCY
- END-ORG
- CONFLICT
- ATTACK
- DEMONSTRATE
- CONTACT
- MEET
- PHONE-WRITE
- PERSONELL
- START-POSITION
- END-POSITION
- NOMINATE
- ELECT
- JUSTICE
- ARREST-JAIL
- RELEASE-PAROLE
- TRIAL-HEARING
- CHARGE-INDICT
- SUE
- CONVICT
- SENTENCE
- FINE
- EXECUTE
- EXTRADITE
- ACQUIT
- APPEAL
- PARDON
Data
- Human-readable file including annotations and Reverb extractions here. How to read this file here.
- Parsed sentence file, including the results of the Stanford parser and Reverb here. How to read this file here.
Enhancements
Time should permit you to extend the baseline system in one or more
ways as your interest directs. Some ideas include:
- It appears that many events in the training set are actually
described as noun phrases, so if you do just ReVerb, then your
recall will be low. One way to find more events (not their
arguments tho) is to look for nominals that correspond to
events. I.e. instead of the verb "exploded" look for the noun
"the explosion"; instead
of "acquired" look for "the acquisition". An off-the-shelf
part-of-speech tagger can identify noun phrases and nominals
would be useful features in a classifier. We can give you a list
of verbs that we've found corresponding to certain event classes
and you can try and automate their conversion into nominals. This
could be done in two ways: 1)
using WordNet somehow,
or 2) crowdsourcing.
- Build a Web front end (or app) for the system, that allows someone to
paste text from a news story of interest (or paste a URL) and then
runs the extractor and displays the results.
- Extend the system to handle more of the 40 event types (or
create your own event types)
- Open information extraction provides the subject are objects
as part of it's triples, but these are treated as text. For
your first baseline, you can ignore these and just deal with
the event type. As an extension you can try to improve the
subject and arguments - or use features of the subject and object
to help you do the classifiaction. Here's one idea, run
the FIGER fine
grained entity recognition system on the subjects and objects and
determine their types (from a predefinded set of 110 types). For
example if an event phrase is "demolished" then that could mean an
attack. But if the subject is a sports team (recognized by FIGER)
then it's indicative of a sporting event instead.
- Really focus on the crowdsourcing aspect. Measure the quality
of data produced.
- Your ideas here.