Review

From: Ankur Jain (ankur@cs.washington.edu)
Date: Wed Dec 08 2004 - 12:59:02 PST


Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL by Peter Turney

* One-line summary

The paper presents a simple unsupervised algorithm for recognizing synonyms.
The key idea is to use a search engine to retrieve from the web a large
corpus of documents containing the words/contexts being tested and use
PMI over them for co-occurence analysis.

* The (two) most important ideas in the paper, and why

The main contribution of the paper is to advocate the simultaeous use of
IR and statistical approaches for unsupervised learning. While, the latter
seems to be very common; leveraging IR to get a huge collection of documents
to do this analysis on is what helps achieve better results.
Using the web as a source of documents and exploiting search engines to do
IR, I thought, was the other main idea in the paper.

* The one or two largest flaws in the paper

The authors accept this themselves, but I thought that the comparison
was very unfair. The author mention a few factors themselves, such as
the hugely dissimilar amounts of documents that LSA and PMI-IR mine.

One crucial factor that the authors somehow ignored (especially
in their running time analysis) is that although the search engine returns
results in just a couple of seconds, that is due to an effort of hundreds
of comupters simulataeously doing number crunching for those couple of
seconds. Moreover, search engines like Altavista presumably do IR
over suitably processed documents with co-occurence/correlation analysis
probably already done -- which makes the comparison even more unfair.

Finally, I don;t know how different text mining is from data-mining --
but in case it is not, then I don't see the big contribution of the paper.

* Identify two important, open research questions on the topic,
and why they matter

On a more myopic level, the one thing that I would really like to see is a
well-fleshed out analysis. Either run PMI on the same set of documents
as LSA to better understand the effect of the larger data source. Even
if the message that the author wants to get across is that it is not just
PMI, but its combination with IR, still he should have given results of PMI
with a smaller dataset as a strawman to compare these results with.

Statistical techniques for analysis suffer from sparse data
problems -- and the most obvious way around this is to use a large data set
to work on -- and what better source than the web itself. An intersting
direction would be to step back and rethink how the tremendous power of
the gargantuan search engines be exploited to get hold of datasets that
were impossible to get hold of a couple of years back.



This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 12:59:03 PST