Review of "Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL - P. D. Turney" Harsha V. Madhyastha This paper presents a unsupervised learning algorithm for recognizing synonyms, which makes use of results provided by a Web search engine. The algorithm, as presented in this paper, is mainly intended to solve questions in tests such as TOEFL where one has to choose the synonym for a word given a set of choices. The algorithm considers each choice one at a time and issues Web requests to determine the probability of co-occurrence of the problem word and the choice under consideration. There are four different techniques that the author uses to measure co-occurrence, which is used as a metric for synonymity. Evaluation on sample questions from TOEFL and ESL show that this algorithm performs better than LSA and also does better than the average score of a human from a non-English speaking country. The main idea in this paper is clearly that it shows the worth of tapping the huge amount of data indexed by Web search engines. Unlike in LSA, the overhead involved is extremely low as all that the proposed algorithm needs to do is issue a few queries to a search engine and assimilate the results returned. Also, since the number of documents indexed by a search engine is orders of magnitude larger than any typical corpus, we no more have to worry about any bias introduced by the set of documents considered. The other major point that I feel the paper highlights is the simplicity of the problem of detecting synonyms. It shows that complex techniques such as LSA are unnecessary and similar performance can be obtained even by naive use of the Web search engine's results. One point that the paper conveniently glossed over is that the proposed algorithm is so simple and low-overhead primarily because of the format in which the synonymity problem is considered. In the alternative problem setting that the author plans to consider in future work - detect the keywords from a given document - the problem is going to be much harder as neither is the problem word nor the choices for its synonym well-defined. In this case, one cannot get away by just issuing a few Web requests, making the overhead of issuing queries significantly higher. I also feel that the author's comparison of the running time of SVD with the time taken to issue a few Web requests is completely unfair. The step of running SVD has to be taken just once in the execution of LSA and should rather be compared with the time taken by the search engine to build up its index. It's just that we don't have to bear the overhead of building the index in the algorithm considered. In LSA too, given a corpus of documents, SVD has to be executed just once and then synonym detection can be done just like how the proposed algorithm does. Though the proposed algorithm is a novel approach, I believe its improvement over LSA is stretched too far. If the author does pursue his intended future work of keyword detection in a document, that will certainly be a really cool application to have. It would also be interesting to see if this technique can be applied in solving crosswords. That problem setting would be unlike both the setting considered in this paper (problem word given with choices for synonyms) as well as that in keyword detection (pairs of words which are synonyms of each other need to be detected). In the case of the crossword, the problem word is known but the choices for its synonym cannot be picked from anywhere.