CSE574 Project Ideas

University of Washington Department of Computer Science & Engineering

CSE Home

About Us

Contact Info

Administrivia

Home

Using course email

Email archive

Policies

Content

Reading Wiki

Resources

Lecture slides

Assignments

Reading Reports

Term Project

Here are some ideas for possible projects. Many of them concern self-supervised information extraction from Wikipedia, because there is a growing project in that area. For some background on this project, see the paper on Kylin. Although this is a bit outdated, it gives one a sense of the project. (We'll be reading this soon, probably on 1/23.).
A great source of ideas is a recent NSF proposal written by Dan. Some infrastructure (beyond Kylin) has already been written - Fei and Raphael can answer questions on this in addition to Dan.

Active Learning in Information Extraction
Read section 5 in the NSF proposal . How might you best model user behavior? Do reputation models of Wikipedia users fit in? What training example (e.g. value for an infobox attribute) would be the best to ask of a user? Should this be modeled as a function just of the extractor's precision? Do other factors (perhaps the underlying reliability of the underlying wikipedia text) help as well? Do you think the factors listed on page 11 are correct? Can you specify the right formula for a utility function?
Alternatively, can one build a good interface to allow a user (aided by a system such as Kylin) to rapidly improve Wikipedia. For example, consider something like the UI shown in Figure 8 in the proposal. This could be done with a Firefox plugin or server-side using a recent Wikipedia dump.
Temporal Information Extraction
There has been some work on extracting events from text, but much remains to be done. Recognizing dates is relatively easy (a regular expression gives high precision and moderate recall - an ok first step), but how do you identfy the correct corresponding event? Wikipedia provides a convenient datset, but the problem is very general.
Extracting Temporal Fluents
The word 'fluent' denotes a relation whose truth changes over time. For example, George W. Bush is presendent of the United States, but this won't always be the case! Furthermore, one can find assertions on the Web that Clinton is president. Can one distinguish the case of a fluent from a non-functional relation (i.e. the case where there can be many presidents of the US)? Are the techniques related to extracting events?
Rationalize List and category pages
Wikipedia List pages are used to collect similar objects, e.g., "List of cities" and "List of universities". It serves as a "instance-class" set. However, many list pages in current Wikipedia are incomplete, and the schemata it uses are different (items, tables). How to automatically clean the List page is a valuable and challenging problem. Needs document classification and schema matching techniques Similar problems exist for category pages.
Taxonomy mapping between List, Category, and Infoboxes
Currently, Wikipedia has different texonomies to organize objects/concepts, such as List, Category and Infoboxes. How to integrate them into a unified taxonomy system is a valuable and challenging problem.
As one example, many Wikipedia categories are conjunctive (e.g. Jewish Physicists) or (Cities in California). If we wish to build a useful ontology, one needs to factor these and cluster them. Is this easy?
Schema Constraint deduction
There are more and more user-contributed (semi-)structured data repositories on the Web. This data has rich information, but usually has a loose schema with few constraints. For example, Wikipedia's Infobox system compiles a huge table with over 15 million records. Can one automatically derive some constraints for the schemata, such as "person.birthdate < person.deathdate", "person.nickname!=person.name", and "person.father is a single-valued attribute"? Determining the data types associated with relations (father of a person is a person) would also be useful.
Incremental training for information extraction
As we are studying in class, CRFs are widely used for information extraction. But the typical training methodology assumes a fixes data set and one-shot training. This isn't realistic for the Web, where training data may keep arriving over time (perhaps because of user interaction as described in the proposal). Can you generate an incremental training algorithm for CRF learning?
Information Verification
Sometimes, the attributes extracted by Kylin are not equal to the values filled by users. Possible reasons could be:

Different representation of the same value ( "02/02/2008" vs. "Feb. 02, 2008")
Either extraction or user's input is wrong.
How to differentiate the above two cases? More interestingly, for the second case, how to decide which part is wrong, and what's the true values? One potential way is to query outside Web (Google) and dig out the correct answers.
Manage data from both machine and users
If is often the case that conflicts exist between the data contributed by a machine learner and a user. How should these conflicts be handled? The obvious approach is to always allow a user to overwrite the machine extracted value, but what if the extractor is high-precision and the user might be a vandal? Perhaps the update should be verified with another user? Even if the user is presumed correct, what if the underlying data changes (e.g. a new census comes out). Should the machine be allowed to correct a human-entered value then? Can the system learn a policy based on determining that certain relations change value over time?
Identify missing information from Wikipedia
Can one leverage information on the greater Web to find missing topics in Wikipedia? One possible way is to identify important concepts on the web and query Wikipedia to check if there is corresponding article. Another possible way is to find noun-phrases which have no corresponding articles yet in Wikipedia. Another related problem is how to fill missing information pieces for objects in Wikipedia. For example, many small cities in Wikipedia have short descriptions. How to find the missing information (like population) in such cases.

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX