From: Martha Mercaldi (mercaldi@cs.washington.edu)
Date: Wed Dec 08 2004 - 11:52:56 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
Summary:
This paper presents an algorithm for solving synonym word problems.
Primary ideas:
The author states that the primary contribution from this work is the
coupling of existing PMI techniques with existing IR
techniques. Several potential PMI scoring functions are presented and
their subtleties are discussed. One central observation is that
relatively simple scoring functions can capture a surprising large
amount of information about word meaning.
One or two largest flaws:
I did not think that this paper explained the context of the work
clearly enough. I’m not an expert in this area, and I found it
difficult to discern what parts were new algorithms and what parts were
new techniques applied to existing algorithms. I gather (perhaps
incorrectly) that PMI was an existing technique, the three scoring
functions were newly developed (if this is the case what scoring
function had been used in the past with PMI?) and that coupling PMI
with IR was the primary contribution of this paper.
My other complaint is a scientific one. Whatever search engine is used,
AltaVista here, might have its own search and correlation algorithms it
uses under the covers. Perhaps this is due to my living in the age of
Google and in fact AltaVista was much more primitive. However it seems
appropriate to at least mention how AltaVista classified something such
as “nearness” if those algorithms are to be incorporated. Otherwise it
is hard to tell if the performance improvement when going from score1()
to score2() is due to the scoring function or some behavior internal to
the search engine.
Open research questions:
One interesting question that I do not think was fully addressed in
this paper was the synergy between the PMI and the IR algorithms
used. With the great strides made in IR in the past 5 years, revisiting
this work might reveal interesting improvements in performance.
The author cites automated extraction of keywords as his ultimate
goal. Personally I feel that as far as scientific literature goes, the
authors generally annotate their work with keywords already and that
the relative small amount of literature does not provide much
motivation for automation. However the idea of a browser annotating
webpages with keywords could be helpful for a user. The massive number
of pages on the web surely motivate automation of the process.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 11:52:59 PST