CSE 574 Project
Active Learning in Wikipedia
Aniruddh Nath
The goal of this project is to extend the KYLIN system's infobox
generator by allowing it to ask questions to human users. It could
be useful to get human feedback on low-confidence training examples.
This would improve the accuracy of the sentence classifier and CRF
extractor.
The other possibility is to have humans directly specify attributes.
This information could also be incorporated into the training data,
improving future extractions.
The central question is the choice of which queries to give which
user. There are many factors that should affect this decision,
probably including:
- The confidence in the prior belief.
- The 'usefulness' of the piece of data. For instance, common textbox items in popular pages are likely to be useful.
- The edit history of the user.
Tentative Schedule
Milestone 1 (Feb 13)
The first thing to try is a simple heuristic approach to question
selection. This system will probably pick questions independently,
and assume that the answer to one query will not affect the utility
of other queries. I'll try to fit this into the KYLIN system, and
see how well this approach does.
Milestone 2 (Feb 27)
Depending on how well milestone 1 performs, it might be worth trying
some more elaborate approach to question selection. For example, a
decision-theoretic extension to the Alchemy system might
be able to handle dependencies between queries better than the
heuristic approach.
If, on the other hand, the heuristic approach seems to be
sufficient, I could make the utility function consider more factors
and see how they affect performance.
Evaluation
The KYLIN experiments will provide a starting point for the system's
evaluation. It would be interesting to see how precision and recall
change after a certain number of user queries.
The change in the utility value is also probably worth looking at,
but that doesn't really tell us whether we chose a good utility
function.