From: Brian Ferris (bdferris@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:48:30 PST
Turney, Peter D. "Mining the Web for Synonyms: PMI-IR versus LSA on
TOEFL."
The paper presents a unsupervised learning algorithm for recognizing
synonyms built using Pointwise Mutual Information (PMI) and Information
Retrieval (IR) taught on statistical data pulled from web-search query
results.
The most important idea in this paper is the use of an unsupervised
web-search for informing the PMI-IR algorithm. The author notes in the
paper that previous work by Landauer and Dumais suggested that PMI
performed poorly in a similar experimental comparison because PMI was
trained on a much smaller data set and PMI is susceptible to spare
data. By coupling PMI-IR with a web-search, the algorithm is exposed
to a very dense data set of English language documents, allowing it to
perform dramatically better.
Another important idea in this paper is the use of context in
conditioning the search results. While simple co-occurrence using the
'AND' operator in search results yielded a reasonable number of correct
answers (62.5%), the additions of the of 'NEAR' operator boosted
correct answers to ~73% on the TOEFL experiments. These results
capture the insight that two words share more mutual information if
they appear closer to each other. The addition of a context-word in
the ESL experiments yielded similar increasing performance.
There were some flaws with the paper. I would like to see some
clarification on the 'corrected for guessing' detail mentioned at the
end of section four in the discussion of Landauer and Dumais paper.
When discussing their results, the author mentions that most of their
test scores went down after correcting for guessing. It is not clear
whether a similar modification was applied to the author's results, or
why he mentioned it at all. A larger issue is with the discussion of
PMI-IR versus SLA. While the results involving just PMI-IR are
compelling, the comparison between the two algorithms would have been
stronger if the author had followed through on some of his ideas for
future work. Specifically, either adjusting the chunk size or the
training data such that PMI-IR and SLA started on more even ground for
comparison.
Immediate areas for future work include the expanded comparison of
PMI-IR and SLA suggested by the author. I think further work in
exploring different heuristics for mutual information between two
terms. As evidence in the performance increases from the addition of
the 'NEAR' operator, more advanced techniques for determining
co-occurrence should be explored. The authors intuition that proximity
suggests synonyms is reasonable, but perhaps more advanced heuristics
could be explored.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:47:08 PST