Review of "Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL - P. D.
Turney"
Harsha V. Madhyastha

This paper presents a unsupervised learning algorithm for recognizing
synonyms, which makes use of results provided by a Web search engine. The
algorithm, as presented in this paper, is mainly intended to solve questions
in tests such as TOEFL where one has to choose the synonym for a word given a
set of choices. The algorithm considers each choice one at a time and issues
Web requests to determine the probability of co-occurrence of the problem word
and the choice under consideration. There are four different techniques that
the author uses to measure co-occurrence, which is used as a metric for
synonymity. Evaluation on sample questions from TOEFL and ESL show that this
algorithm performs better than LSA and also does better than the average score
of a human from a non-English speaking country.

The main idea in this paper is clearly that it shows the worth of tapping the
huge amount of data indexed by Web search engines. Unlike in LSA, the overhead
involved is extremely low as all that the proposed algorithm needs to do is 
issue a few queries to a search engine and assimilate the results returned.
Also, since the number of documents indexed by a search engine is orders of
magnitude larger than any typical corpus, we no more have to worry about any
bias introduced by the set of documents considered. The other major point that
I feel the paper highlights is the simplicity of the problem of detecting
synonyms. It shows that complex techniques such as LSA are unnecessary and
similar performance can be obtained even by naive use of the Web search
engine's results.

One point that the paper conveniently glossed over is that the proposed
algorithm is so simple and low-overhead primarily because of the format in
which the synonymity problem is considered. In the alternative problem setting
that the author plans to consider in future work - detect the keywords from a
given document - the problem is going to be much harder as neither is the
problem word nor the choices for its synonym well-defined. In this case, one
cannot get away by just issuing a few Web requests, making the overhead of
issuing queries significantly higher. I also feel that the author's
comparison of the running time of SVD with the time taken to issue a few Web
requests is completely unfair. The step of running SVD has to be taken just 
once in the execution of LSA and should rather be compared with the time
taken by the search engine to build up its index. It's just that we don't have
to bear the overhead of building the index in the algorithm considered. In LSA
too, given a corpus of documents, SVD has to be executed just once and then
synonym detection can be done just like how the proposed algorithm does.
Though the proposed algorithm is a novel approach, I believe its improvement
over LSA is stretched too far.

If the author does pursue his intended future work of keyword detection in a
document, that will certainly be a really cool application to have. It would
also be interesting to see if this technique can be applied in solving
crosswords. That problem setting would be unlike both the setting considered
in this paper (problem word given with choices for synonyms) as well as that
in keyword detection (pairs of words which are synonyms of each other need to
be detected). In the case of the crossword, the problem word is known but the
choices for its synonym cannot be picked from anywhere.