Here are some ideas for possible projects. Many of them concern
self-supervised information extraction from Wikipedia, because there
is a growing project in that area. For some background on this
project, see the paper on Kylin.
Although this is a bit outdated, it gives one a sense of the
project. (We'll be reading this soon, probably on 1/23.).
- Active Learning in Information Extraction
Read section 5 in the NSF proposal
. How might you best model user behavior? Do reputation models of
Wikipedia users fit in? What training example (e.g. value for an
infobox attribute) would be the best to ask of a user? Should this
be modeled as a function just of the extractor's precision? Do other
factors (perhaps the underlying reliability of the underlying
wikipedia text) help as well? Do you think the factors listed on
page 11 are correct? Can you specify the right formula for a utility
function?
Alternatively, can one build a good interface to allow a user (aided
by a system such as Kylin) to rapidly improve Wikipedia. For example,
consider something like the UI shown in Figure 8 in the proposal. This
could be done with a Firefox plugin or server-side using a recent
Wikipedia dump.
- Temporal Information Extraction
There has been some work on extracting events from text, but much
remains to be done. Recognizing dates is relatively easy (a regular
expression gives high precision and moderate recall - an ok first
step), but how do you identfy the correct corresponding event?
Wikipedia provides a convenient datset, but the problem is very general.
- Extracting Temporal Fluents
The word 'fluent' denotes a relation whose truth changes over
time. For example, George W. Bush is presendent of the United States,
but this won't always be the case! Furthermore, one can find
assertions on the Web that Clinton is president. Can one distinguish
the case of a fluent from a non-functional relation (i.e. the case
where there can be many presidents of the US)? Are the techniques
related to extracting events?
- Rationalize List and category pages
Wikipedia List pages are used to collect similar objects, e.g., "List
of cities" and "List of universities". It serves as a "instance-class"
set. However, many list pages in current Wikipedia are incomplete,
and the schemata it uses are different (items, tables). How to
automatically clean the List page is a valuable and challenging
problem. Needs document classification and schema matching techniques
Similar problems exist for category pages.
- Taxonomy mapping between List, Category, and Infoboxes
Currently, Wikipedia has different texonomies to organize
objects/concepts, such as List, Category and Infoboxes. How to
integrate them into a unified taxonomy system is a valuable and
challenging problem.
As one example, many Wikipedia categories are conjunctive (e.g. Jewish
Physicists) or (Cities in California). If we wish to build a useful
ontology, one needs to factor these and cluster them. Is this easy?
- Schema Constraint deduction
There are more and more user-contributed (semi-)structured data
repositories on the Web. This data has rich information, but usually
has a loose schema with few constraints. For example, Wikipedia's
Infobox system compiles a huge table with over 15 million records. Can
one automatically derive some constraints for the schemata, such as
"person.birthdate < person.deathdate", "person.nickname!=person.name",
and "person.father is a single-valued attribute"? Determining the
data types associated with relations (father of a person is a person)
would also be useful.
- Incremental training for information extraction
As we are studying in class, CRFs are widely used for information
extraction. But the typical training methodology assumes a fixes data
set and one-shot training. This isn't realistic for the Web, where
training data may keep arriving over time (perhaps because of user
interaction as described in the proposal). Can you generate an
incremental training algorithm for CRF learning?
- Information Verification
Sometimes, the attributes extracted by Kylin are not equal to the
values filled by users. Possible reasons could be:
- Different representation of the same value ( "02/02/2008" vs. "Feb.
02, 2008")
- Either extraction or user's input is wrong.
How to differentiate the above two cases? More interestingly, for the
second case, how to decide which part is wrong, and what's the true
values? One potential way is to query outside Web (Google) and dig out
the correct answers.
- Manage data from both machine and users
If is often the case that conflicts exist between the data contributed
by a machine learner and a user. How should these conflicts be
handled? The obvious approach is to always allow a user to overwrite
the machine extracted value, but what if the extractor is
high-precision and the user might be a vandal? Perhaps the update
should be verified with another user? Even if the user is presumed
correct, what if the underlying data changes (e.g. a new census comes
out). Should the machine be allowed to correct a human-entered value
then? Can the system learn a policy based on determining that certain
relations change value over time?
- Identify missing information from Wikipedia
Can one leverage information on the greater Web to find missing topics
in Wikipedia? One possible way is to identify important concepts on
the web and query Wikipedia to check if there is corresponding
article. Another possible way is to find noun-phrases which have no
corresponding articles yet in Wikipedia. Another related problem is
how to fill missing information pieces for objects in Wikipedia. For
example, many small cities in Wikipedia have short descriptions. How
to find the missing information (like population) in such cases.