From: Adrienne Wang (axwang@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:40:20 PST
Title: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Author: Peter D. Turney
Summary: A simple unsupervised learning algorithm PMI-IR is
designed for recognizing synonyms, which uses Pointwise Mutual
Information (PMI) and Information Retrieval (IR) to measure the
similarity of pairs of words.
Important ideas: 1. Pointwise Mutual Information will give an
score to estimate the sematic similarity between two words. Not
only the scores considers the synonyms, but also the antonyms and
context-dependent words. In LSA paper, they claimed the MI
analysis would give a similar accuracy. But actually it turns out
PMI is better. 2. Using Web as the data source, the usual
difficulty for sematic similarity measure is the sparseness of the
data. But Web provides a good and huge data source. In there
project, they use Alta-vista search engine.
Weak points: The two methods use different databases, so the
results seems not comparable to each other. Especially some
researchers have pointed out that LSA would perform better than
PMI if given the same database. So probably the reason for the
good performance of PMI is because of the huge data for Web, and
even from search engine.
Possible research directions: 1. Scale the LSA up to Web size
database and then test the two methods. 2. Do PMI-IR on
encyclopedia text scale database and test the two performance.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:42:34 PST