Review3-Mining the Web for Synonyms

From: Jon Froehlich (jfroehli@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:53:19 PST

  • Next message: Seth Cooper: "Review of " Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL""

    1. Paper title/Author

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL by Peter D. Turney.

     

    2. One-line Summary

    Paper introduces a simple unsupervised learning algorithm, called PMI-IR,
    that combines an information theory based algorithm (PMI) with an
    information retrieval based algorithm (IR) to recognize synonyms and is
    shown to be more successful than LSA, a more complicated algorithm.

     

    3. Two Most Important Ideas and Why

    (i) Shows how information potential of web can be leveraged to build a
    simple, unsupervised learning algorithm. A system written in Perl using
    PMI-IR was shown to be very successful in recognizing synonyms in
    standardized tests typically given to non-native English speakers to test
    language understanding and mastery. It was also purported to perform better
    than a previously existing

     

    (ii) Highlights the importance of synonyms in increasing information
    retrieval precision. Though I'm not entirely sure this is a novel
    contribution given that, as the author points out, two systems exist which
    utilize this notion in one way or another: LSA and query

     

    4. Flaws in Paper

     

    (i) The author does not sufficiently address corpus size difference in tests
    used to compare LSA to PMI-IR. Though both the TOEFL and ESL experiments
    used to evaluate PMI-IR appear sound (the very fact that Dumais used a
    similar setup to validate LSA lends credibility to this method), the
    comparison to the performance of LSA is rather difficult (as the author
    fully admits). The author states, "The results with the TOEL questions show
    that PMI-IR can score almost 10% higher than LSA. The results with the ESL
    questions support the view that this performance is not a chance
    occurrence." This is sort of misleading as the author then follows with,
    "the interpretation of the results is difficult." because ".PMI-IR is using
    a much larger data source than LSA." It seems reasonable that, given this
    discussion, the author would have then introduced a more balanced experiment
    which either scales LSA up to using a corpus based on the AltaVista database
    or scales PMI-IR down to using a corpus based on an encyclopedia. The author
    does mention this disparity but does so in the context of future work: "For
    future work, it would be interesting to see how LSA performs with such a
    large collection of text." He even goes on to say that, "perhaps the
    strength of LSA is that it can achieve relatively good performance with
    relatively little text" and that "it seems likely that PMI-IR achieves high
    performance by 'brute force', through sheer size of the corpus of text that
    is indexed by AltaVista." Perhaps, I'm being overly nitpicky here, but I
    believe the points raised above should have been more adequately addressed,
    especially given the fact that the subtitle of this paper is "PMI-IR versus
    LSA on TOEFL." After reading the paper, I am convinced that PMI-IR is a nice
    way of utilizing readily available web interfaces to improve synonym
    recognition (I certainly agree with the notion that should use data if you
    have access to it) but I am not convinced that PMI-IR is necessarily better
    than LSA (particularly under balanced constraints).

     

    (ii) The author stresses the simplicity of PMI-IR throughout the paper,
    particularly in comparison to LSA. However, it seems this "simplicity" is a
    relation of perspective. Certainly, upfront LSA is computationally more
    expensive as it depends on matrix calculations however it's not clear which
    algorithm, LSA or PMI-IR, requires more resources/computational
    infrastructure. For example, PMI-IR relies on the AltaVista web search
    engine for its data source and searching algorithm; this presupposes a ready
    connection to the internet. LSA seems to work much better on smaller
    document collections and therefore, may work better on mobile applications,
    for example, where data bandwidth is limited. (The author does touch on this
    a bit with his example hybrid system which uses a "small, local search
    engine for high-frequency words, but resorts to a large, distant search
    engine for rare words.") This is perhaps not a flaw, but a clear functional
    difference between the two algorithms. It should be noted that I am
    encouraged by/fully support applications that exploit the web for data; I
    believe data mining will be an integral part of many interesting
    applications of the future.

     

    (iii) Overall, I didn't feel like this paper was as comprehensive or as well
    written as the last paper we read. In particular, I thought the Experiments,
    Discussion of Results and the Applications sections were rather soft. I
    covered my qualms re: the experiments and discussion in point (i) above. The
    Applications section is an opportunity for the author to "wow" the audience
    with the relevancy/ "coolness" of their work; none of the applications
    seemed overly interesting.

     

     

    5. Two or Three Important Open Research Questions on Topic and Why They
    Matter

    (i) Questions that came to mind while reading this paper: could PMI-IR be
    used to automatically construct thesauruses (with lists of both synonyms and
    antonyms)? How would including a comprehensive dictionary and thesaurus (say
    Oxford's) into a text corpus affect either PMI-IR or LSA? Would the results
    improve? What other applications could benefit from PMI-IR?

     

    (ii) Unlike the prior two papers that we've read for class, this paper has
    many suggestions for future work embedded in the discussion of results. I
    thought it might be interesting to write these out together to get a better
    sense of direction. The author's own delineation of future work fall into
    two categories, (1) further evaluation of the LSA/PMI-IR algorithms and (1)
    suggestions of use for PMI-IR.

     

    To further evaluation LSA/PMI-IR:

    1. Regarding the possibility of applying LSA to the same corpus used in
    the PMI-IR experiment (e.g. the AltaVista database of 350 million webpages):
    "For future work it would be interesting to see how LSA performs with such
    a large collection of text."
    2. Regarding the testing of the Landauer and Dumais claim that mutual
    information analysis would obtain a score of about 37% on the TOEFL
    questions, given the same source text and chunk size as they used LSA:
    "Although it appears that they have not tested this conjecture, it seems
    plausible to me. It would be interesting to test this hypothesis. Although
    it might be a challenge to scale LSA up to this volume of text, PMI can
    easily be scaled down to the encyclopedia text that is used by Landauer and
    Dumais. This is another possibility for future work."
    3. Regarding the suggestion that much of the difference between LSA and
    PMI-IR is due to the smaller chunk size of PMI-IR (where a chunk is a
    document or article): "It is interesting that the TOEFL performance for
    score1 (62.5%) is approximately the same as the performance for LSA (64.4%).
    Much of the difference in performance between LSA and PMI-IR comes from
    using the NEAR operator instead of the AND operator. This suggests that
    perhaps much of the difference between LSA and PMI-IR is due to the smaller
    chunk size of PMI-IR (for the scores other than score1). To test this
    hypothesis, the LSA experiment with TEOFL could be repeated using the same
    source text (an encyclopedia), but a smaller chunk size. This is another
    possibility for future work."
    4. Regarding the hypothesis that query expansion achieves essentially
    the same effect as LSI: "The hypothesis implies that LSI will tend to
    perform better than an IR system without query expansion, but there will be
    no significant difference between an IR system with LSI and an IR system
    with query expansions (assuming all factors are equal)."

     

    To build applications using PMI-IR:

    1. Use PMI-IR as a tool to aid in the construction of lexical
    databases.
    2. Use PMI-IR to improve IR systems (i.e. through the use of informed
    query expansion techniques)
    3. Apply PMI-IR to automatic keyword extraction.

     


  • Next message: Seth Cooper: "Review of " Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL""

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 08:53:24 PST