From: Anna Cavender (cavender@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:01:09 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
One line summary:
PMI-IR, a simple unsupervised learning algorithm for recognizing
synonyms uses Pointwise Mutual Information (PMI) to evaluate the
similarity of words, and Information Retrieval (IR) from the web via the
Alta Vista search engine. When compared with LSA on TOEFL exams, PMI-IR
recognizes about 10% more synonyms than LSA.
The two most important ideas in the paper:
The author has observed that PMI can be used to test the similarity of
words that occur together or near each other on web pages that are
conveniently indexed by Web search engines.
Several clever additions to the algorithm mostly eliminate words that
may co-occur but that are not synonyms (such as antonyms). Also, if the
query contains context (such as in ESL exams) that context can be used
to ensure query results contain the proper synonym for the given context.
Due to the relative speed of the Alta Vista search engine, finding
synonyms is quick (16 seconds per question depending on network connection).
The one or two largest flaws in the paper:
The discussion of why PMI-IR performed better that LSA was a bit shallow.
It is unclear to my why this author chose to compare PMI-IR with LSA
instead of with LSI. It seems unfair to compare a learning algorithm
whose data source is the world wide web to one whose data source is a
local encyclopedia. Furthermore, if PMI is “sensitive to the sparse data
problem,” as in it only performs well on large data sources, this
comparison is particularly unfair.
The author mentions that other leading IR techniques have not shown an
advantage to LSI, so it may be more interesting to test PMI-IR against
these other leading IR techniques.
Two important open research questions on the topic and why they matter:
It would be interesting to use PMI-IR for search using query expansion
and compare it to TREC systems.
The author hopes to use PMI-IR for automatic keyword extraction. This
seems a bit dangerous because, as of this study, it has only shown a 74%
acceptance rate for multiple choice problems which are a much easier
domain than keyword extraction.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:01:15 PST