From: Stephen Friedman (sfriedma@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:37:18 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL by Peter D. Turney
This paper compared the use of PMI-IR to the use of LSA on the task of
synonym recognition using TOEFL synonym questions.
The first main idea of this paper is that a simple algorithm for
unsupervised learning of synonyms can be made by attaching a Pointwise
Mutual Information algorithm to an Information Retrieval (i.e. web
search engine) back end. When applied to synonym recognition, this
algorithm (with the proper notion of what constitutes PMI) was able to
perform better than the average non-English college applicant. The
second main idea was that PMI-IR performed better than the LSA algorithm.
To me, the biggest flaw was that they were trying to compare apples to
oranges. They try to compare to LSA, where LSA was using an
encyclopedia and they were using the web.
Given that there was about a factor of 20 difference in the corpus of
knowledge used, but not a factor of 20 difference in the performance, it
is simply misleading to say that PMI-IR is better than LSA in any way,
in fact, it may be worse. The real claim should that PMI-IR using a
search engine corpus can perform better than LSA using an encyclopedic
corpus. There is no way to know whether the difference in performance
was due to differing algorithms, differing corpuses, or both.
Essentially, it would have been just as valid for the paper title to
have been “Mining the Web for Synonyms: Web versus Encyclopedia on TOEFL.”
Clearly, the question suggested by the title of this paper is the most
obvious still open research question. One could more accurately judge
the relative strengths and weaknesses by using the same corpus and chunk
size or same compute time for both algorithms and comparing accuracy.
Another open research question is the scalability of the two algorithms.
It was suggested that the PMI-IR approach would not scale down well
because it didn’t work well on sparse data. Also, it may be that LSA
does not gain as much benefit as it scales to larger corpuses, so it may
perform better on smaller problems, but be outpaced by PMI-IR. I think
that this would be a far more useful comparison, as it would tell you
which algorithm is better in a given domain, instead of specific instances.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:37:24 PST