From: Indriyati Atmosukarto (indria@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:44:48 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
This paper describes a form of unsupervised learning algorithm
which combines the use of Pointwise Mutual Information (PMI)
with Information Retrieval (IR) to measure the similarity
of pairs of words. The paper went on to compare the result of
the algorithm's performance against LSA and shows that PMI-IR
performed better than LSA on the TOEFL dataset.
It was interesting to see how the author combined the use of
PMI with IR. The algorithm is based on the co-occurence of the
problem word with the choice words .The PMI score for each choice
of words are calculated using probabilities for problem word and
choice words. These probabilities are calculated based on the
results retrieved by the IR queries. The author also came up with
four ways of calculating the score for each choice words ranging
from a simple interpretation to a more refined interpretation of
word co-occurence by using NEAR operator from the search engine
and taking the context of the words into consideration as well.
Though the idea seems quite interesting but the lack of experiments
results fails to convince me of the algorithms's true performance.
Only 80 TOEFL questions and 50 ESL test questions were used in the
experiments which when you think about it is a very small dataset
considering the fact that TOEFL has been around for years. In addition,
the paper only presented the result of LSA on the TOEFL dataset but
not on the ESL dataset. More evaluation on the different performance of the
different scoring techniques would have been appreciated as well.
The first question that occur to my mind when I first read the introduction
was why did the author use AltaVista search engine to retrieve the information
for document collection. I would be interested to see whether there would be
any difference in the algorithm's performance if the author used Google or
MSN search to retrieve the information. Especially since Google has a number
of operators that can be used to constraint the search queries. It would also be
interesting to see how the algorihtm fairs on other types of tests such as the
GRE verbal test which is known to be more difficult that TOEFL. The verbal test
not only contains synonym questions, but also antonyms, analogies, sentence
completion and the most difficult is reading comprehension section.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 08:44:49 PST