From: Gaurav Bhaya (gbhaya@cs.washington.edu)
Date: Tue Dec 07 2004 - 20:53:44 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
--------------------------------------------------------
Short Summary: This paper introduces an unsupervised learning
algorithm based on web query for finding synonyms of a given
word. It uses the popular search engine which contains an index
of millions of words to find the probability of co-occurrence of
the given words.
Important Ideas:
- One of the most important idea presented in the paper is using
the vast index of millions and billions of documents indexed by
a search engine. In this was it not only exploits a large base
of documents but also exploits computational power of search engine.
- The second point the paper proves that most complex problems have
a really simple solution. By doing simple operations the PMI-IR
is able to obtain better results than LSA on TOEFL questions.
- I thought they way the paper derived four metrics for related
words was very interesting. It raised a question whether there
exists an "even better" metric given the performance difference
between the four presented in the paper.
- I esp. liked the way in which the paper represented closeness of
the two words in the last of the 4 metrics. Using the idea of context
seems very interesting.
Flaws:
- Are TOEFL questions really a good choice for the problem. Most
TOEFL questions are designed such that humans would tend to choose
one of the wrong choices based on some guessing pattern. This
guessing pattern may not represent the co-occurrence of the words in
any way. Furthermore, it may be possible to generate a set of questions
that may be difficult for the above algorithm using two words that
occur in a phrase.
- The paper compares the computation time to query the web server.
What about the time taken by web Server to crawl all the web pages
and build an index. Is the time comparison between two algorithms
fair?
- Isn't the sample size too small to make a conclusion. 74% accuracy on
not even 100 items!!
Open Questions:
- Can this technique be used when the alternatives are not given? Can
such a tool be designed that finds synonyms of a given word given
just the word itself? Unless this is possible, hand coded dictionaries
cannot be eliminated.
- How can this idea aid in solving the crossword puzzle that we saw
earlier. Would this technique be good enough to replace all the
expert modules in the previous paper- since it exploits something
more than the synonyms.
- The paper points out various other future work possibilities towards
the end. Some of these include extending to LSI, using a smaller index.
This archive was generated by hypermail 2.1.6 : Tue Dec 07 2004 - 20:53:52 PST