From: Beltran Ibarra Davila-Armero (bida@cs.washington.edu)
Date: Wed Dec 08 2004 - 02:22:47 PST
*Mining the Web for synonyms: PMI-IR versus LSA on TOEFL_ *
By Peter D. Turney
This article describes two techniques to find synonyms, PMI-IR and LSA,
and discus their different results on the TOEFL test and the ESL test.
One of the major ideas is that up to now, programs that looked for
synonyms, often used only small databases or at least not as big as the
Web. The idea of PMI-IR is that by querying the web one can find
synonyms pretty accurately. Of course, the quality of the result depends
on the quality of the query, but with good queries one can get better
results than querying a “small” database (up to 10% better).
A cool idea developed by this article is the way it has formulated
different ways of querying the search engine in order to get better
results. I like that evolution from the only AND operator to the
combination of AND, NEAR, NOT and the context. Although, as he admits
that his program uses brute force, he tried to refine it a little.
One of the flaws of this article is the poor experimentation. I have
taken the TOEFL exam several times (the first time when I was thirteen
years old) and I, as well as many other people, have always considered
it as an easy test, specially those questions where you have four
choices (which narrow the search a lot!!!!). I don’t think that TOEFL
testing was a good idea. Maybe he did it to test his program on the same
basis as the LSA technique, but then he should have reduced the database
to a comparable size than the database used by the LSA.
Another flaw is that he does not explain what pushed them to have those
particular scores and how they improved the results. For example, he
does not explain the improvements brought by score_3 in terms of
avoiding antagonisms. Maybe some deeper experimentation on this would
have been welcomed.
Also, I thought that the experimentation did not seem very pushed. I
mean that 80 TOEFL test do not really prove that a technique is really
superior to another. When one thinks of the amount of TOEFL tests
available (I think it is the most common English exam in the world), I
guess that there are millions, 80 seems a very poor number.
Finally, the end of the article seems to give a lot of future work in
terms of improvements. For an AAAI paper there are many improvements
maybe that could have been done before publishing it.
I guess that one of the most obvious open questions is what would happen
if there were not any choices. Would this kind of mining be effective?
That is a field to explore, since when we need synonyms, we do not
always have access to possibilities.
And could this technique be used to find other kind of related words
like antagonisms, for example?
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 02:22:53 PST