Paper Review!

From: Annamalai Muthu (muthu@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:16:34 PST

  • Next message: Stephen Friedman: "Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL"

    Paper Title: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

    Authors: Peter D. Turney

    Summary: The paper presents an unsupervised learning algorithm, PMI-IR,
    that is used to recognize synonyms using the results of queries to a web
    search engine.

    The important ideas:

    -The paper presents a very good idea, PMI(Pointwise Mutual
    Information)-IR(Information Retrieval), that uses the results from the
    query to a web search engine to determine synonyms from a given set. The
    method computes scores for the different candidate solutions, by querying
    a web search engine (Alta Vista in this case), and analyzing the documents
    returned. Four different scoring methodologies are provided, with the
    fourth being the most complex of the lot. The fourth heuristic calculates
    the probability of two words occurring close together, without either one
    being negated, and both the words being used in a particular context.

    -The paper examines, LSA, another approach to the same problem of synonym
    identification. PMI-IR outperforms LSA, but LSA has the attribute that
    though it is time-consuming, it works well on a small data set, whereas
    PMI-IR requires a large amount of data. This illustrates a very nice
    tradeoff in AI. The lower the amount of data available, the more
    intelligent and complex the algorithm has to be in order to work well.
    This was similar to the crossword paper, where with a large amount of
    data, a modestly intelligent approach does well.

    Flaws:

    -I felt that some of the results could have been better interpreted. For
    example, in the discussion, it was mentioned that score2 rates antonyms
    high, while score3 avoids this. This was not reflected well in the
    results presented in Tables 3 and 4. The difference is not significant.
    That can be due to two different reasons. The first one can be because of
    the small size of the data set. The second can be because of the nature
    of the chosen data set.

    -How dependent is this method on the underlying web search engine? This
    would have been a nice section in the paper. The paper just presented the
    basic idea without discussing all the surrounding issues. The data set
    used was from tests used to establish a user.s minimal command of English
    (TOEFL). A more complex test would have justified their conclusions.

    Open research questions:

    -This paper again leads to the open research question of whether programs
    can understand the semantics of English language and answer such
    questions. This would allow the program to not only answer such synonym
    questions, but it also allows the program to answer a wider variety of
    questions such as reading comprehension questions and etc.

    -Word replacement in querying is one of the mentioned applications. Is it
    possible to understand what a user is querying for, and rephrase the query
    completely to obtain optimal results? This requires a much more
    sophisticated solution than simply word replacement in the query.


  • Next message: Stephen Friedman: "Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:16:35 PST