From: Pravin Bhat (pravinb@u.washington.edu)
Date: Wed Dec 08 2004 - 05:44:09 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
Paper Summary: The paper presents an unsupervised learning algorithm
that solves the multiple choice synonym matching problem by combining
results from multiple web queries.
Paper Strengths:
The author applies a mathematical analysis tool, Pointwise Mutual
Information, to increase the precision performance of a a high recall
Information Retrieval method, namely. web queries. The author is able
to demonstrate that the fairly simply and intuitive PMI approach tends
to perform as just well as LSA, a mathematically stronger but
computational expensive method, when used on large databases.
The paper is ingenious in that it presents a technique to leverage existing
low-cost technologies as research tools. The two main technologies leveraged
are the internet, which is probably the largest distributed database of its
kind, and sophisticated search engines which employ all the latest techniques
in the Information Retrieval field to remain competitive. By building on the
all the research that has already gone into these technologies the author
was able to keep his side of the implementation to a bare minimum.
Paper Flaws:
I fail to understand why the author stopped at "not" as his antonym filter.
There are several modifiers in the english language which imply an
antonym - "vs" (David vs Goliath), "as opposed to", "instead of", etc
The author could have produced an exhaustive list of such modifiers
and then reduced the list to N-most frequently used modifiers. The
frequency of each modifier could have been estimated using a web-query.
Increasing N would have increased the algorithm precision at the cost of runtime.
What the algorithm actually calculates is co-occurrence of word pairs which
might be a reasonable approximation for the synonym relation at
the TOEFL level. However high co-occurrence does not always occur due
to synonyms and often the ambiguity cannot simply be resolved by a
antonym filter. For example the method is likely to match "lord" as
synonym to "rings" simply because "lord of the rings" is a popular
sequence of words on the internet. Similarly the method is also prone
to errors/biases in the search engine. For example, not too long along google
was bombed into associating "dismal failure" with "George Bush" which
would have thrown off the algorithm (OK, bad example).
Future work:
- Comparing LSA and PMI-IR on the same database sizes. This would
help us better understand exactly how much data is required to get away
with using PMI-IR.
- More work could be done to semantically analyze the search results from
the NEAR option. This way the algorithm will be able to use the rules of
the english grammar to filter out nonsensical co-occurrences like -
".. was small. Big day for..."
- Another way to filter junk from the search results would be to limit the search
to credible sources - online encyclopedias, dictionaries, journals, newspapers, etc.
Google has an option that lets limit your searches to particular websites or domains.
Similarly accuracy can be improved by searching in specialized domains for
domain specific terms. For example if we knew our query word was a math term
then we could limit our search to Wolfram.com.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 05:44:09 PST