PMI-IR review

From: Kevin Wampler (wampler@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:51:53 PST

  • Next message: Martha Mercaldi: "Review #3"

    In "Mining the Web for Synonyms: PMI-IR versus ISA on TOEFL" the author,
    Peter Turney, describes a simple algorithm based on a pointwise mutual
    information measures of Internet search results which outperforms LSA in
    finding synonyms on the TOEFL test.

    The primary strengths of this method are its success and its
    simplicity. Provided that one can give queries to search engine with
    the conjunctions NEAR AND and OR, and get the number or results found
    (not a hard task) then the implementation of the algorithm is quite
    easy. Given this, the score of 74% on the TOEFL test seems even more
    impressive given LSA's performance of 64.4%. This provides an
    illustration that a relative simple probability model, appropriately
    chosen, can often perform quite well in practice, and that relying on
    results produced from a huge body of data can be a very effective way to
    attack very hard AI problems.

    There are, however, some rather important issues which arise with this
    method. Because of this method's reliance on the number of results
    returned by Altavista, there is the possibility of the algorithm
    appearing better (and simpler) than it really is by implicitly using the
    search algorithms of Altavista. If Altavista (presumably for speed
    purposes) does not generate all documents which match a query, but
    attempts to select the more relevant documents, it's probable that the
    results of the PMI-IR algorithm rely largely on Altavista's ranking
    algorithms. This likely makes the actual algorithm being used much more
    complicated (although much of it is hidden in the black box that is the
    Altavista search engine, possibly a very desirable trait).

    Synonym matching seems to be a good example of a rather difficult
    natural language processing problem to which searching the web can be a
    very powerful tool. It would be interesting to if the same technique of
    searching the web to solve hard AI tasks by finding patterns in the huge
    bulk of data can effectively tackle more complex sorts of problems.
    Actually, I;m sure that there's a lot of research on this exact question
    -- I'm just not aware of the results.


  • Next message: Martha Mercaldi: "Review #3"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:51:54 PST