Paper review 3 (Jonas Klink)

From: jklink@u.washington.edu
Date: Wed Dec 08 2004 - 01:20:44 PST

  • Next message: Harsha V. Madhyastha: "PMI-IR review"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
    Peter D. Turney

    One-line summary
    The paper presents a synonym-finding and unsupervised algorithm, based on combining Pointwise Mutual Information (PMI) with Information Retrieval (IR) from a huge text mining resource: the Web.

    Main ideas
    By combining the vastness of the Web with the work done by search engines (such as AltaVista), with a simple algorithm (PMI-IR) for scoring co-occurrence of words, the paper argues for a machine method of achieving a good score on the synonym parts of the TOEFL and ESL tests. The complete algorithm of PMI-IR evaluates the use of four different search criteria (scoring methods), which with the help of a web query generates a set of hits.

    A similar query is then performed to establish the number of documents where the given alternative words (to be determined as synonyms or not) occurs NEAR the query word (nearness is in this case defined as occurring within ten words from the query). The final score is then calculated as the conditional probability of the query word, given the synonym alternative. PMI-IR correctly finds 74% of the synonyms; compared to 64% correct by LSA (Latent Semantic Analysis).

    Flaws
    To me, the biggest flaw in this article is the lack of thorough testing. The test results from the 80 questions from the TOEFL and the 50 from ESL tests are in themselves both interesting and impressive (given the simple program the author claims to have used), but they are for me not nearly enough to establish the superiority the author claims for PMI-IR over LSA. Additional tests are certainly needed and also a comparison on how LSA fares on the ESL tests (with context).

    In the Related Work section, the author also makes a remark upon humanly encoded synonym information systems being less prone to mistakes. The argument here misses the fact that the amount of entered data that these lexicons contain, is bound to be to at least some degree error-prone (due to the human factor) and therefore not always by definition better performing than machine-learning systems.

    The article is lacking a more detailed description on the short Perl program used. Would be interesting to (from a presented discussion on the source code) see, if optimizations can be done, and extensions to other applications performed easily.

    Open questions and improvements
    One of the big appeals of the project presented in this paper was for me the use of the abundance of information available through web searches. I think we have seen far too few (in any field) products that makes good use of the quickly and easily accessed amount of data, and I certainly think and hope more will follow the good example made by this paper.

    It would be interesting to see how LSI would fare with an context test (as ESL) and if these algorithms, combined with context reasoning, could help in solving problems with e.g. Natural Language Processing applications.


  • Next message: Harsha V. Madhyastha: "PMI-IR review"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 01:20:45 PST