CSE 574 Project

Active Learning in Wikipedia

Aniruddh Nath

The goal of this project is to extend the KYLIN system's infobox generator by allowing it to ask questions to human users. It could be useful to get human feedback on low-confidence training examples. This would improve the accuracy of the sentence classifier and CRF extractor.
The other possibility is to have humans directly specify attributes. This information could also be incorporated into the training data, improving future extractions.
The central question is the choice of which queries to give which user. There are many factors that should affect this decision, probably including:
  • The confidence in the prior belief.
  • The 'usefulness' of the piece of data. For instance, common textbox items in popular pages are likely to be useful.
  • The edit history of the user.

Tentative Schedule

Milestone 1 (Feb 13)
The first thing to try is a simple heuristic approach to question selection. This system will probably pick questions independently, and assume that the answer to one query will not affect the utility of other queries. I'll try to fit this into the KYLIN system, and see how well this approach does.
Milestone 2 (Feb 27)
Depending on how well milestone 1 performs, it might be worth trying some more elaborate approach to question selection. For example, a decision-theoretic extension to the Alchemy system might be able to handle dependencies between queries better than the heuristic approach.
If, on the other hand, the heuristic approach seems to be sufficient, I could make the utility function consider more factors and see how they affect performance.

Evaluation

The KYLIN experiments will provide a starting point for the system's evaluation. It would be interesting to see how precision and recall change after a certain number of user queries.
The change in the utility value is also probably worth looking at, but that doesn't really tell us whether we chose a good utility function.