From: Jiun-Hung Chen (jhchen@cs.washington.edu)
Date: Wed Dec 08 2004 - 01:04:57 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
Review by Jiun-Hung Chen
1. Summary
This paper proposes an unsupervised learning algorithm which uses Pointwise Mutual Information (PMI)
and Information Retrieval (IR) for learning synonyms.
2. Most important ideas
The most important idea in this paper is to perform a task by issuing queries to a search engine and
analyzing the replies to the queries. I think people use this idea all the time and it works very well.
For example, you want to eat some Japanese food but don't know where to eat. You may send a query
like "good Japanese restaurant in Seattle" to Google, analyze the replies and then decide on a restaurant.
The key points are WWW is a huge database and search engines can provide very useful and reliable replies
to queries. To formulate learning synonyms as an unsupervised learning and to solve this problem by analyzing
cooccurence are difficult and interesting. I believe that the success can be ascribed to the key insight
that a word is characterized by the company it keeps. In contrast, a supervised learning for synonyms
seems to be intuitive and trivial.
3. Largest flaws
The largest flaw is that the author exaggerates comparisons in the abstract
although he does mention that comparisons between PMI-IR and LSA are biased
because experiments are not done under the same conditions. On the other hand,
I think fair comparisons are missing. The other flaw is that hits may be good estimates for probabilities
but the author does not verify this point.
4. Open research questions
Extending this work to finding relations in sentences, paragraphs or documents by mining the web can be very
important and useful for natural language processing and understanding. Furthermore, mining the web for
visual information such as images and movies is challenging because no obvious structure information
such as grammars is available.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 01:04:57 PST