Review #3

From: Brian Ferris (bdferris@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:48:30 PST

  • Next message: Kevin Wampler: "PMI-IR review"

    Turney, Peter D. "Mining the Web for Synonyms: PMI-IR versus LSA on
    TOEFL."

    The paper presents a unsupervised learning algorithm for recognizing
    synonyms built using Pointwise Mutual Information (PMI) and Information
    Retrieval (IR) taught on statistical data pulled from web-search query
    results.

    The most important idea in this paper is the use of an unsupervised
    web-search for informing the PMI-IR algorithm. The author notes in the
    paper that previous work by Landauer and Dumais suggested that PMI
    performed poorly in a similar experimental comparison because PMI was
    trained on a much smaller data set and PMI is susceptible to spare
    data. By coupling PMI-IR with a web-search, the algorithm is exposed
    to a very dense data set of English language documents, allowing it to
    perform dramatically better.

    Another important idea in this paper is the use of context in
    conditioning the search results. While simple co-occurrence using the
    'AND' operator in search results yielded a reasonable number of correct
    answers (62.5%), the additions of the of 'NEAR' operator boosted
    correct answers to ~73% on the TOEFL experiments. These results
    capture the insight that two words share more mutual information if
    they appear closer to each other. The addition of a context-word in
    the ESL experiments yielded similar increasing performance.

    There were some flaws with the paper. I would like to see some
    clarification on the 'corrected for guessing' detail mentioned at the
    end of section four in the discussion of Landauer and Dumais paper.
    When discussing their results, the author mentions that most of their
    test scores went down after correcting for guessing. It is not clear
    whether a similar modification was applied to the author's results, or
    why he mentioned it at all. A larger issue is with the discussion of
    PMI-IR versus SLA. While the results involving just PMI-IR are
    compelling, the comparison between the two algorithms would have been
    stronger if the author had followed through on some of his ideas for
    future work. Specifically, either adjusting the chunk size or the
    training data such that PMI-IR and SLA started on more even ground for
    comparison.

    Immediate areas for future work include the expanded comparison of
    PMI-IR and SLA suggested by the author. I think further work in
    exploring different heuristics for mutual information between two
    terms. As evidence in the performance increases from the addition of
    the 'NEAR' operator, more advanced techniques for determining
    co-occurrence should be explored. The authors intuition that proximity
    suggests synonyms is reasonable, but perhaps more advanced heuristics
    could be explored.


  • Next message: Kevin Wampler: "PMI-IR review"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:47:08 PST