From: Annamalai Muthu (muthu@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:16:34 PST
Paper Title: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Authors: Peter D. Turney
Summary: The paper presents an unsupervised learning algorithm, PMI-IR,
that is used to recognize synonyms using the results of queries to a web
search engine.
The important ideas:
-The paper presents a very good idea, PMI(Pointwise Mutual
Information)-IR(Information Retrieval), that uses the results from the
query to a web search engine to determine synonyms from a given set. The
method computes scores for the different candidate solutions, by querying
a web search engine (Alta Vista in this case), and analyzing the documents
returned. Four different scoring methodologies are provided, with the
fourth being the most complex of the lot. The fourth heuristic calculates
the probability of two words occurring close together, without either one
being negated, and both the words being used in a particular context.
-The paper examines, LSA, another approach to the same problem of synonym
identification. PMI-IR outperforms LSA, but LSA has the attribute that
though it is time-consuming, it works well on a small data set, whereas
PMI-IR requires a large amount of data. This illustrates a very nice
tradeoff in AI. The lower the amount of data available, the more
intelligent and complex the algorithm has to be in order to work well.
This was similar to the crossword paper, where with a large amount of
data, a modestly intelligent approach does well.
Flaws:
-I felt that some of the results could have been better interpreted. For
example, in the discussion, it was mentioned that score2 rates antonyms
high, while score3 avoids this. This was not reflected well in the
results presented in Tables 3 and 4. The difference is not significant.
That can be due to two different reasons. The first one can be because of
the small size of the data set. The second can be because of the nature
of the chosen data set.
-How dependent is this method on the underlying web search engine? This
would have been a nice section in the paper. The paper just presented the
basic idea without discussing all the surrounding issues. The data set
used was from tests used to establish a user.s minimal command of English
(TOEFL). A more complex test would have justified their conclusions.
Open research questions:
-This paper again leads to the open research question of whether programs
can understand the semantics of English language and answer such
questions. This would allow the program to not only answer such synonym
questions, but it also allows the program to answer a wider variety of
questions such as reading comprehension questions and etc.
-Word replacement in querying is one of the mentioned applications. Is it
possible to understand what a user is querying for, and rephrase the query
completely to obtain optimal results? This requires a much more
sophisticated solution than simply word replacement in the query.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:16:35 PST