Mining the web for synonyms - PMI-IR style !

From: Ravi Kiran (kiran@cs.washington.edu)
Date: Wed Dec 08 2004 - 09:30:40 PST

  • Next message: Mathias Ganter: "paper review: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL"

            Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

                                 Peter.D.Turney

    Summary:

            This paper proposes an algorithm for recognizing synonyms, based
    on query analysis using Pointwise Mutual Information(PMI) and Information
    Retrieval(IR).

    Two important ideas presented in the paper:

            This paper ties in well with the speculation presented in
    "Proverb:The Proverbial Cruciverbalist" that the advances in technology
    can be leveraged to solve problems previously deemed intractable. In this
    case, the problem of finding synonyms, which involves extracting matches
    from a database was simplified using a database which is extremely
    optimized for search -- the web-search engine AltaVista. Also, the notion
    of using the NEAR operator provided by the search engine in implementing
    the notion of similarity was a particularly appealing idea. Of particular
    appeal was the idea of using logarithm of the co-occurence probabilities
    as an information measure for scoring choices for a given query.

            As the paper notes, employing the synonym-finder for keyword-based
    extraction, particularly for scientific literature has the potential to
    improve the accuracy of such query systems.

    Two flaws in the paper:

            One of the things I found surprising was that if the system uses
    the Web as a database, why could it not use an online
    dictionary/thesaurus, such as the one found at http://m-w.com ( The
    Merriam Webster website ) ? The results ( particularly, the accuracy
    obtained) are intriguing because, while studying for GRE and TOEFL in my
    undergrad( 2002 ), I used a simple perl script to extract synonyms for a
    given word and I got a very high percentage of words ( around 90% )
    correct. Using a thesaurus ( from the same website) decreased the search
    time and increased accuracy in my case.

            The results have been presented on a small database of TOEFL/ESL
    questions. Given the fact that TOEFL/ESL have been around for some time,
    the small size is surprising. Also, there was no characterization of
    results with respect to difficulty of questions, which is inherent in
    examinations such as TOEFL/ESL. Failure analysis is distinctly absent.

            

    Future directions for research:

            The algorithm should be expanded to incorporate the importance
    scores of the queries. It is well known that search engines rank queries
    based on relevance and importance. This could be introduced, possibly as a
    probabilisic measure. This would also help in finding out how effective
    the querying mechanism itself is.

            The PMI-IR algorithm assumes that the words surrounding a query,
    particularly in case of a sentence, are statistically independent, when it
    performs the context scoring. However, context arises more often because
    of a sequence of words ('phrases') rather than with a single word ( viewed
    in this light, the example of tap and maple being contextually related
    seems quite contrived ). Therefore, a bigram-analysis and other n-gram
    analyses would be of immense help in improving the accuracy of the result.


  • Next message: Mathias Ganter: "paper review: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 09:30:41 PST