From: Craig M Prince (cmprince@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:42:22 PST
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
By Peter D. Turney
This paper describes a system for determining word synonyms by using
information retrieval methods (namely by utilizing the results of the
AltaVista search engine) in order to determine the co-occurence between
words -- resulting in a system that can solve word synonym problems with
high accuracy for the TOEFL and ESL tests.
One of the biggest contributions of this paper is that it uses the
internet as a source of tremendous information. By leveraging the power of
an existing search engine, PMI-IR is able to quickly utilize and access
the knowledge on the internet. In a way, PMI-IR is using the relationships
already discovered by the AltaVista search engine in order to find
synonyms more efficiently.
Another important contribution that I thought was really neat was that the
author was able to analyze the context in various ESL test questions in
order to perform initial disambiguation. This was again done using queries
to the search engine. The simplicity of the given method is pretty amazing
-- the fact that so much can be gleamed from just the number of hits
returned from a search engine.
One concern I have with the work is that it works on problems where you
are given a list of candidates. This seems an artificial artifact of the
"test question" scenario. If there are only four choices then it seems
that it makes the job much easier and lends itself to a co-occurence
analysis. On the other hand, this seems to make it less useful for other
problems (such as improving internet search using synonyms).
Another issue is that the author doesn't mention any of the failure
scenarios. When the system does fail, why does it fail? Also, does it fail
on some easy cases or does it only fail on words that humans would also
find difficult to disambiguate. Without knowing the types of problems the
system fails on, it is difficult to know if there is some inherent
limitation to the system or not.
I think that this paper does a good job of outlining some of the future
work for determining synonyms; however, are there other problems that
would benefit from applying IR methods. The internet is a vast store of
information and common knowledge, and search engines are designed to allow
us to quickly index this store. How can we use this?
Another area of future research would be to look at whether there are
better/more scores that can be used to improve the results of the system.
The queries given have some good intuition behind them, but there could be
additional queries that even further improve results. Could looking at
content of pages directly also be beneficial?
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 08:42:22 PST