Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

From: Stephen Friedman (sfriedma@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:37:18 PST

  • Next message: Adrienne Wang: "Review"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL by Peter D. Turney

    This paper compared the use of PMI-IR to the use of LSA on the task of
    synonym recognition using TOEFL synonym questions.

    The first main idea of this paper is that a simple algorithm for
    unsupervised learning of synonyms can be made by attaching a Pointwise
    Mutual Information algorithm to an Information Retrieval (i.e. web
    search engine) back end. When applied to synonym recognition, this
    algorithm (with the proper notion of what constitutes PMI) was able to
    perform better than the average non-English college applicant. The
    second main idea was that PMI-IR performed better than the LSA algorithm.

    To me, the biggest flaw was that they were trying to compare apples to
    oranges. They try to compare to LSA, where LSA was using an
    encyclopedia and they were using the web.
    Given that there was about a factor of 20 difference in the corpus of
    knowledge used, but not a factor of 20 difference in the performance, it
    is simply misleading to say that PMI-IR is better than LSA in any way,
    in fact, it may be worse. The real claim should that PMI-IR using a
    search engine corpus can perform better than LSA using an encyclopedic
    corpus. There is no way to know whether the difference in performance
    was due to differing algorithms, differing corpuses, or both.
    Essentially, it would have been just as valid for the paper title to
    have been “Mining the Web for Synonyms: Web versus Encyclopedia on TOEFL.”

    Clearly, the question suggested by the title of this paper is the most
    obvious still open research question. One could more accurately judge
    the relative strengths and weaknesses by using the same corpus and chunk
    size or same compute time for both algorithms and comparing accuracy.
    Another open research question is the scalability of the two algorithms.
      It was suggested that the PMI-IR approach would not scale down well
    because it didn’t work well on sparse data. Also, it may be that LSA
    does not gain as much benefit as it scales to larger corpuses, so it may
    perform better on smaller problems, but be outpaced by PMI-IR. I think
    that this would be a far more useful comparison, as it would tell you
    which algorithm is better in a given domain, instead of specific instances.


  • Next message: Adrienne Wang: "Review"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:37:24 PST