CSE574 Project Proposal

Topic

 

Extracted Information Verification

 

Member

 

Peng Dai

 

Objective

 

One annoying problem of automatic information extraction software, such as KYLIN, is that sometimes the attribute and value pairs extracted are inconsistent with human typed values. There are two explanations to this phenomenon. One is that, one side is incorrect. Either there is an error in the value provided by human, or the extraction software failed to get the correct one. The other is that, both are correct but in different format. For example, there are different gauge systems, so that 100km/h can be seen as equivalence to 60mph. The goal of this project is to improve the performance of KYLIN by means of information verification.

 

Plan

 

REALM is an unsupervised extraction verification system. Its input is a set of extraction tuples. REALM verifies the correctness of each extraction by using some machine learning techniques, and outputs the same set of tuples, but sorted according to its confidence on their correctness. Our first method of information verification is to pipeline KYLIN and REALM, with the consideration that the combination of two smart IE systems can refine the results than using only KYLIN.

 

Web is our natural repository, so we can mine the web for answer. Our another option is to first generate queries by combining the feature name and the inconsistent feature values as different input to powerful search engines such as google or yahoo. Then we use the computed mutual information between the feature and value to judge the correctness of the information.

 

In our experiments, we are going to compare the precision/recall performance of the new system against KYLIN. We plan to use several categories in our experiments. We are interested in answering the following questions:

1)      Can KYLIN benefit from integrating REALM?

2)      What is the precision/recall of the combination system?

3)      If 1) is true, then under which category can KYLIN benefit the most? If 2) is not true, what is the reason?

 

Milestone 1: get familiar with KYLIN and REALM, pipeline the two systems

Milestone 2: evaluating the performance of the new system, learn tools on how to generate queries on search engines, research on mutual information topic

Final: experimentation, writing

Optional: make statistical analysis on query results