From: Ravi Kiran (kiran@cs.washington.edu)
Date: Wed Dec 08 2004 - 09:30:40 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter.D.Turney
Summary:
This paper proposes an algorithm for recognizing synonyms, based
on query analysis using Pointwise Mutual Information(PMI) and Information
Retrieval(IR).
Two important ideas presented in the paper:
This paper ties in well with the speculation presented in
"Proverb:The Proverbial Cruciverbalist" that the advances in technology
can be leveraged to solve problems previously deemed intractable. In this
case, the problem of finding synonyms, which involves extracting matches
from a database was simplified using a database which is extremely
optimized for search -- the web-search engine AltaVista. Also, the notion
of using the NEAR operator provided by the search engine in implementing
the notion of similarity was a particularly appealing idea. Of particular
appeal was the idea of using logarithm of the co-occurence probabilities
as an information measure for scoring choices for a given query.
As the paper notes, employing the synonym-finder for keyword-based
extraction, particularly for scientific literature has the potential to
improve the accuracy of such query systems.
Two flaws in the paper:
One of the things I found surprising was that if the system uses
the Web as a database, why could it not use an online
dictionary/thesaurus, such as the one found at http://m-w.com ( The
Merriam Webster website ) ? The results ( particularly, the accuracy
obtained) are intriguing because, while studying for GRE and TOEFL in my
undergrad( 2002 ), I used a simple perl script to extract synonyms for a
given word and I got a very high percentage of words ( around 90% )
correct. Using a thesaurus ( from the same website) decreased the search
time and increased accuracy in my case.
The results have been presented on a small database of TOEFL/ESL
questions. Given the fact that TOEFL/ESL have been around for some time,
the small size is surprising. Also, there was no characterization of
results with respect to difficulty of questions, which is inherent in
examinations such as TOEFL/ESL. Failure analysis is distinctly absent.
Future directions for research:
The algorithm should be expanded to incorporate the importance
scores of the queries. It is well known that search engines rank queries
based on relevance and importance. This could be introduced, possibly as a
probabilisic measure. This would also help in finding out how effective
the querying mechanism itself is.
The PMI-IR algorithm assumes that the words surrounding a query,
particularly in case of a sentence, are statistically independent, when it
performs the context scoring. However, context arises more often because
of a sequence of words ('phrases') rather than with a single word ( viewed
in this light, the example of tap and maple being contextually related
seems quite contrived ). Therefore, a bigram-analysis and other n-gram
analyses would be of immense help in improving the accuracy of the result.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 09:30:41 PST