Paper Review 3

From: Gaurav Bhaya (gbhaya@cs.washington.edu)
Date: Tue Dec 07 2004 - 20:53:44 PST

  • Next message: Jiun-Hung Chen: "paper review #3"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

    Peter D. Turney

    --------------------------------------------------------

     

    Short Summary: This paper introduces an unsupervised learning

    algorithm based on web query for finding synonyms of a given

    word. It uses the popular search engine which contains an index

    of millions of words to find the probability of co-occurrence of

    the given words.

     

    Important Ideas:

    - One of the most important idea presented in the paper is using

    the vast index of millions and billions of documents indexed by

    a search engine. In this was it not only exploits a large base

    of documents but also exploits computational power of search engine.

    - The second point the paper proves that most complex problems have

    a really simple solution. By doing simple operations the PMI-IR

    is able to obtain better results than LSA on TOEFL questions.

    - I thought they way the paper derived four metrics for related

    words was very interesting. It raised a question whether there

    exists an "even better" metric given the performance difference

    between the four presented in the paper.

    - I esp. liked the way in which the paper represented closeness of

    the two words in the last of the 4 metrics. Using the idea of context

    seems very interesting.

     

    Flaws:

    - Are TOEFL questions really a good choice for the problem. Most

    TOEFL questions are designed such that humans would tend to choose

    one of the wrong choices based on some guessing pattern. This

    guessing pattern may not represent the co-occurrence of the words in

    any way. Furthermore, it may be possible to generate a set of questions

    that may be difficult for the above algorithm using two words that

    occur in a phrase.

    - The paper compares the computation time to query the web server.

    What about the time taken by web Server to crawl all the web pages

    and build an index. Is the time comparison between two algorithms

    fair?

    - Isn't the sample size too small to make a conclusion. 74% accuracy on

    not even 100 items!!

     

    Open Questions:

    - Can this technique be used when the alternatives are not given? Can

    such a tool be designed that finds synonyms of a given word given

    just the word itself? Unless this is possible, hand coded dictionaries

    cannot be eliminated.

    - How can this idea aid in solving the crossword puzzle that we saw

    earlier. Would this technique be good enough to replace all the

    expert modules in the previous paper- since it exploits something

    more than the synonyms.

    - The paper points out various other future work possibilities towards

    the end. Some of these include extending to LSI, using a smaller index.


  • Next message: Jiun-Hung Chen: "paper review #3"

    This archive was generated by hypermail 2.1.6 : Tue Dec 07 2004 - 20:53:52 PST