From: Kevin Wampler (wampler@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:51:53 PST
In "Mining the Web for Synonyms: PMI-IR versus ISA on TOEFL" the author,
Peter Turney, describes a simple algorithm based on a pointwise mutual
information measures of Internet search results which outperforms LSA in
finding synonyms on the TOEFL test.
The primary strengths of this method are its success and its
simplicity. Provided that one can give queries to search engine with
the conjunctions NEAR AND and OR, and get the number or results found
(not a hard task) then the implementation of the algorithm is quite
easy. Given this, the score of 74% on the TOEFL test seems even more
impressive given LSA's performance of 64.4%. This provides an
illustration that a relative simple probability model, appropriately
chosen, can often perform quite well in practice, and that relying on
results produced from a huge body of data can be a very effective way to
attack very hard AI problems.
There are, however, some rather important issues which arise with this
method. Because of this method's reliance on the number of results
returned by Altavista, there is the possibility of the algorithm
appearing better (and simpler) than it really is by implicitly using the
search algorithms of Altavista. If Altavista (presumably for speed
purposes) does not generate all documents which match a query, but
attempts to select the more relevant documents, it's probable that the
results of the PMI-IR algorithm rely largely on Altavista's ranking
algorithms. This likely makes the actual algorithm being used much more
complicated (although much of it is hidden in the black box that is the
Altavista search engine, possibly a very desirable trait).
Synonym matching seems to be a good example of a rather difficult
natural language processing problem to which searching the web can be a
very powerful tool. It would be interesting to if the same technique of
searching the web to solve hard AI tasks by finding patterns in the huge
bulk of data can effectively tackle more complex sorts of problems.
Actually, I;m sure that there's a lot of research on this exact question
-- I'm just not aware of the results.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:51:54 PST