(no subject)

From: Pravin Bhat (pravinb@u.washington.edu)
Date: Wed Dec 08 2004 - 05:44:09 PST

  • Next message: Craig M Prince: "Reading Review 12-08-2004"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
    Peter D. Turney

    Paper Summary: The paper presents an unsupervised learning algorithm
    that solves the multiple choice synonym matching problem by combining
    results from multiple web queries.

    Paper Strengths:
    The author applies a mathematical analysis tool, Pointwise Mutual
    Information, to increase the precision performance of a a high recall
    Information Retrieval method, namely. web queries. The author is able
    to demonstrate that the fairly simply and intuitive PMI approach tends
    to perform as just well as LSA, a mathematically stronger but
    computational expensive method, when used on large databases.

    The paper is ingenious in that it presents a technique to leverage existing
    low-cost technologies as research tools. The two main technologies leveraged
    are the internet, which is probably the largest distributed database of its
    kind, and sophisticated search engines which employ all the latest techniques
    in the Information Retrieval field to remain competitive. By building on the
    all the research that has already gone into these technologies the author
    was able to keep his side of the implementation to a bare minimum.

    Paper Flaws:
    I fail to understand why the author stopped at "not" as his antonym filter.
    There are several modifiers in the english language which imply an
    antonym - "vs" (David vs Goliath), "as opposed to", "instead of", etc
    The author could have produced an exhaustive list of such modifiers
    and then reduced the list to N-most frequently used modifiers. The
    frequency of each modifier could have been estimated using a web-query.
    Increasing N would have increased the algorithm precision at the cost of runtime.

    What the algorithm actually calculates is co-occurrence of word pairs which
    might be a reasonable approximation for the synonym relation at
    the TOEFL level. However high co-occurrence does not always occur due
    to synonyms and often the ambiguity cannot simply be resolved by a
    antonym filter. For example the method is likely to match "lord" as
    synonym to "rings" simply because "lord of the rings" is a popular
    sequence of words on the internet. Similarly the method is also prone
    to errors/biases in the search engine. For example, not too long along google
    was bombed into associating "dismal failure" with "George Bush" which
    would have thrown off the algorithm (OK, bad example).

    Future work:
    - Comparing LSA and PMI-IR on the same database sizes. This would
    help us better understand exactly how much data is required to get away
    with using PMI-IR.

    - More work could be done to semantically analyze the search results from
    the NEAR option. This way the algorithm will be able to use the rules of
    the english grammar to filter out nonsensical co-occurrences like -
    ".. was small. Big day for..."

    - Another way to filter junk from the search results would be to limit the search
    to credible sources - online encyclopedias, dictionaries, journals, newspapers, etc.
    Google has an option that lets limit your searches to particular websites or domains.
    Similarly accuracy can be improved by searching in specialized domains for
    domain specific terms. For example if we knew our query word was a math term
    then we could limit our search to Wolfram.com.


  • Next message: Craig M Prince: "Reading Review 12-08-2004"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 05:44:09 PST