PMI-IR versus LSA on TOEFL

From: Anna Cavender (cavender@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:01:09 PST

  • Next message: Annamalai Muthu: "Paper Review!"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

    Peter D. Turney

    One line summary:

    PMI-IR, a simple unsupervised learning algorithm for recognizing
    synonyms uses Pointwise Mutual Information (PMI) to evaluate the
    similarity of words, and Information Retrieval (IR) from the web via the
    Alta Vista search engine. When compared with LSA on TOEFL exams, PMI-IR
    recognizes about 10% more synonyms than LSA.

    The two most important ideas in the paper:

    The author has observed that PMI can be used to test the similarity of
    words that occur together or near each other on web pages that are
    conveniently indexed by Web search engines.

    Several clever additions to the algorithm mostly eliminate words that
    may co-occur but that are not synonyms (such as antonyms). Also, if the
    query contains context (such as in ESL exams) that context can be used
    to ensure query results contain the proper synonym for the given context.

    Due to the relative speed of the Alta Vista search engine, finding
    synonyms is quick (16 seconds per question depending on network connection).

    The one or two largest flaws in the paper:

    The discussion of why PMI-IR performed better that LSA was a bit shallow.

    It is unclear to my why this author chose to compare PMI-IR with LSA
    instead of with LSI. It seems unfair to compare a learning algorithm
    whose data source is the world wide web to one whose data source is a
    local encyclopedia. Furthermore, if PMI is “sensitive to the sparse data
    problem,” as in it only performs well on large data sources, this
    comparison is particularly unfair.

    The author mentions that other leading IR techniques have not shown an
    advantage to LSI, so it may be more interesting to test PMI-IR against
    these other leading IR techniques.

    Two important open research questions on the topic and why they matter:

    It would be interesting to use PMI-IR for search using query expansion
    and compare it to TREC systems.

    The author hopes to use PMI-IR for automatic keyword extraction. This
    seems a bit dangerous because, as of this study, it has only shown a 74%
    acceptance rate for multiple choice problems which are a much easier
    domain than keyword extraction.


  • Next message: Annamalai Muthu: "Paper Review!"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:01:15 PST