Reading Review 12-08-2004

From: Craig M Prince (cmprince@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:42:22 PST

  • Next message: Indriyati Atmosukarto: "Paper Review 3"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
    By Peter D. Turney

    This paper describes a system for determining word synonyms by using
    information retrieval methods (namely by utilizing the results of the
    AltaVista search engine) in order to determine the co-occurence between
    words -- resulting in a system that can solve word synonym problems with
    high accuracy for the TOEFL and ESL tests.

    One of the biggest contributions of this paper is that it uses the
    internet as a source of tremendous information. By leveraging the power of
    an existing search engine, PMI-IR is able to quickly utilize and access
    the knowledge on the internet. In a way, PMI-IR is using the relationships
    already discovered by the AltaVista search engine in order to find
    synonyms more efficiently.

    Another important contribution that I thought was really neat was that the
    author was able to analyze the context in various ESL test questions in
    order to perform initial disambiguation. This was again done using queries
    to the search engine. The simplicity of the given method is pretty amazing
    -- the fact that so much can be gleamed from just the number of hits
    returned from a search engine.

    One concern I have with the work is that it works on problems where you
    are given a list of candidates. This seems an artificial artifact of the
    "test question" scenario. If there are only four choices then it seems
    that it makes the job much easier and lends itself to a co-occurence
    analysis. On the other hand, this seems to make it less useful for other
    problems (such as improving internet search using synonyms).

    Another issue is that the author doesn't mention any of the failure
    scenarios. When the system does fail, why does it fail? Also, does it fail
    on some easy cases or does it only fail on words that humans would also
    find difficult to disambiguate. Without knowing the types of problems the
    system fails on, it is difficult to know if there is some inherent
    limitation to the system or not.

    I think that this paper does a good job of outlining some of the future
    work for determining synonyms; however, are there other problems that
    would benefit from applying IR methods. The internet is a vast store of
    information and common knowledge, and search engines are designed to allow
    us to quickly index this store. How can we use this?

    Another area of future research would be to look at whether there are
    better/more scores that can be used to improve the results of the system.
    The queries given have some good intuition behind them, but there could be
    additional queries that even further improve results. Could looking at
    content of pages directly also be beneficial?


  • Next message: Indriyati Atmosukarto: "Paper Review 3"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 08:42:22 PST