From: Jon Froehlich (jfroehli@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:53:19 PST
1. Paper title/Author
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL by Peter D. Turney.
2. One-line Summary
Paper introduces a simple unsupervised learning algorithm, called PMI-IR,
that combines an information theory based algorithm (PMI) with an
information retrieval based algorithm (IR) to recognize synonyms and is
shown to be more successful than LSA, a more complicated algorithm.
3. Two Most Important Ideas and Why
(i) Shows how information potential of web can be leveraged to build a
simple, unsupervised learning algorithm. A system written in Perl using
PMI-IR was shown to be very successful in recognizing synonyms in
standardized tests typically given to non-native English speakers to test
language understanding and mastery. It was also purported to perform better
than a previously existing
(ii) Highlights the importance of synonyms in increasing information
retrieval precision. Though I'm not entirely sure this is a novel
contribution given that, as the author points out, two systems exist which
utilize this notion in one way or another: LSA and query
4. Flaws in Paper
(i) The author does not sufficiently address corpus size difference in tests
used to compare LSA to PMI-IR. Though both the TOEFL and ESL experiments
used to evaluate PMI-IR appear sound (the very fact that Dumais used a
similar setup to validate LSA lends credibility to this method), the
comparison to the performance of LSA is rather difficult (as the author
fully admits). The author states, "The results with the TOEL questions show
that PMI-IR can score almost 10% higher than LSA. The results with the ESL
questions support the view that this performance is not a chance
occurrence." This is sort of misleading as the author then follows with,
"the interpretation of the results is difficult." because ".PMI-IR is using
a much larger data source than LSA." It seems reasonable that, given this
discussion, the author would have then introduced a more balanced experiment
which either scales LSA up to using a corpus based on the AltaVista database
or scales PMI-IR down to using a corpus based on an encyclopedia. The author
does mention this disparity but does so in the context of future work: "For
future work, it would be interesting to see how LSA performs with such a
large collection of text." He even goes on to say that, "perhaps the
strength of LSA is that it can achieve relatively good performance with
relatively little text" and that "it seems likely that PMI-IR achieves high
performance by 'brute force', through sheer size of the corpus of text that
is indexed by AltaVista." Perhaps, I'm being overly nitpicky here, but I
believe the points raised above should have been more adequately addressed,
especially given the fact that the subtitle of this paper is "PMI-IR versus
LSA on TOEFL." After reading the paper, I am convinced that PMI-IR is a nice
way of utilizing readily available web interfaces to improve synonym
recognition (I certainly agree with the notion that should use data if you
have access to it) but I am not convinced that PMI-IR is necessarily better
than LSA (particularly under balanced constraints).
(ii) The author stresses the simplicity of PMI-IR throughout the paper,
particularly in comparison to LSA. However, it seems this "simplicity" is a
relation of perspective. Certainly, upfront LSA is computationally more
expensive as it depends on matrix calculations however it's not clear which
algorithm, LSA or PMI-IR, requires more resources/computational
infrastructure. For example, PMI-IR relies on the AltaVista web search
engine for its data source and searching algorithm; this presupposes a ready
connection to the internet. LSA seems to work much better on smaller
document collections and therefore, may work better on mobile applications,
for example, where data bandwidth is limited. (The author does touch on this
a bit with his example hybrid system which uses a "small, local search
engine for high-frequency words, but resorts to a large, distant search
engine for rare words.") This is perhaps not a flaw, but a clear functional
difference between the two algorithms. It should be noted that I am
encouraged by/fully support applications that exploit the web for data; I
believe data mining will be an integral part of many interesting
applications of the future.
(iii) Overall, I didn't feel like this paper was as comprehensive or as well
written as the last paper we read. In particular, I thought the Experiments,
Discussion of Results and the Applications sections were rather soft. I
covered my qualms re: the experiments and discussion in point (i) above. The
Applications section is an opportunity for the author to "wow" the audience
with the relevancy/ "coolness" of their work; none of the applications
seemed overly interesting.
5. Two or Three Important Open Research Questions on Topic and Why They
Matter
(i) Questions that came to mind while reading this paper: could PMI-IR be
used to automatically construct thesauruses (with lists of both synonyms and
antonyms)? How would including a comprehensive dictionary and thesaurus (say
Oxford's) into a text corpus affect either PMI-IR or LSA? Would the results
improve? What other applications could benefit from PMI-IR?
(ii) Unlike the prior two papers that we've read for class, this paper has
many suggestions for future work embedded in the discussion of results. I
thought it might be interesting to write these out together to get a better
sense of direction. The author's own delineation of future work fall into
two categories, (1) further evaluation of the LSA/PMI-IR algorithms and (1)
suggestions of use for PMI-IR.
To further evaluation LSA/PMI-IR:
1. Regarding the possibility of applying LSA to the same corpus used in
the PMI-IR experiment (e.g. the AltaVista database of 350 million webpages):
"For future work it would be interesting to see how LSA performs with such
a large collection of text."
2. Regarding the testing of the Landauer and Dumais claim that mutual
information analysis would obtain a score of about 37% on the TOEFL
questions, given the same source text and chunk size as they used LSA:
"Although it appears that they have not tested this conjecture, it seems
plausible to me. It would be interesting to test this hypothesis. Although
it might be a challenge to scale LSA up to this volume of text, PMI can
easily be scaled down to the encyclopedia text that is used by Landauer and
Dumais. This is another possibility for future work."
3. Regarding the suggestion that much of the difference between LSA and
PMI-IR is due to the smaller chunk size of PMI-IR (where a chunk is a
document or article): "It is interesting that the TOEFL performance for
score1 (62.5%) is approximately the same as the performance for LSA (64.4%).
Much of the difference in performance between LSA and PMI-IR comes from
using the NEAR operator instead of the AND operator. This suggests that
perhaps much of the difference between LSA and PMI-IR is due to the smaller
chunk size of PMI-IR (for the scores other than score1). To test this
hypothesis, the LSA experiment with TEOFL could be repeated using the same
source text (an encyclopedia), but a smaller chunk size. This is another
possibility for future work."
4. Regarding the hypothesis that query expansion achieves essentially
the same effect as LSI: "The hypothesis implies that LSI will tend to
perform better than an IR system without query expansion, but there will be
no significant difference between an IR system with LSI and an IR system
with query expansions (assuming all factors are equal)."
To build applications using PMI-IR:
1. Use PMI-IR as a tool to aid in the construction of lexical
databases.
2. Use PMI-IR to improve IR systems (i.e. through the use of informed
query expansion techniques)
3. Apply PMI-IR to automatic keyword extraction.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 08:53:24 PST