From: Seth Cooper (scooper@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:58:29 PST
The paper read is titled "Mining the Web for Synonyms: PMI-IR versus LSA
on TOEFL" by Turney. It discusses using Pointwise Mutual Information
(PMI) and Information Retrieval (IR) to guess synonyms and compares it
to a pervious algorithm, Latent Semantic Analysis (LSA).
The Internet is a huge corpus of data for a machine to learn from.
However, most of it is not marked in a way that is useful for supervised
learning. It could be very useful, however, for an algorithm that could
use it to perform unsupervised learning. One of the strengths of this
algorithm is that it is able to learn from a large corpus of data
without needing a human to guide it or give it feedback, and it still
does reasonably well. The algorithm is able to analyze and come up with
useful responses for test questions from the TOEFL, even to the point
where is does a bit better than some of the humans who take the test,
and, given that each question takes about sixteen seconds, probably
faster than some humans. Another strength of the paper lies in the fact
that it leverages the existing technology provided by AltaVista. This
allows the actual algorithm itself to be fairly concise, elegant, and
efficient.
A weakness of this paper comes up in the comparison with LSA. Although
PMI-IR does significantly better than LSA on the TOEFL, it had a much
larger data set to draw information from. Although it might take a
while to crunch the SVD, it would be a fairer comparison to test against
an LSA that had used a corpus of equal size. Also, although the paper
mentions the average score of students from non-English speaking
countries, it doesn’t mention the overall average or that of students
from English speaking countries, which are presumably higher. It would
be interesting for comparison to see how PMI-IR actually does against an
average native English speaker, to see if getting about 75% correct is
acceptable.
One open question from the paper is: can similar unsupervised learning
algorithms be developed for different purposes? Given that we have
access to the vast amount of data on the internet, is it possible to
answer other questions beyond what the synonym of a word is? Would it
be possible to take advantage of specialized information on the
Internet, rather that just looking at it as a bunch of words?
Another interesting idea to apply to this would be something along the
lines of ensembles of classifiers, using various search engines. The
paper only used AltaVista, but there are other alternatives, which might
give different scores to the choices. Queries could be run on several
search engines, and then they could vote on which one they thought was
best. This might also relate to the PROVERB paper, because the results
from the various search engines would have to be combined in some
meaningful way, just like the results from the various expert modules.
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 09:00:33 PST