From: Mathias Ganter (mganter@u.washington.edu)
Date: Wed Dec 08 2004 - 09:34:28 PST
Authors and Title
Turney, Peter (2001) Mining the Web for Synonyms: PMI-IR versus LSA on
TOEFL. In De Raedt, Luc and Flach, Peter, Eds. Proceedings Proceedings of
the Twelfth European Conference on Machine Learning (ECML-2001), pages pp.
491-502, Freiburg, Germany.
Remarks
This paper by P. Turney presents an unsupervised learning algorithm based on
the concept of co-occurrence for recognizing synonyms by analyzing responses
to queries sent to an online search engine. The implemented algorithm uses
PMI and IR to measure the consistency between pairs of queries and answers
and is finally evaluated by using TOEFL and by comparison with both the
performance of LSA and the performance of non-English US College applicants.
There are 4 scores of increasing sophistication yielding an increasing
percentage of correct answers.
The major concept of this paper is the implementation and use of an
unsupervised learning algorithm to extract information from the biggest
source of information you may wish, i.e. the World Wide Web thus answering a
specific question. It assigns a score to each possible candidate and selects
the choice that maximizes the score. It is interesting to see how knowledge
of semantics increases this score - because the interpretation of literary
language and spoken language can be considered as really difficult for
computers (as a professor once told me). The author outlines the algorithms
performance compared to non-machine-learning systems that lack to perform
well in areas of expertise. It is also mentioned that most of the hard work
done to find synonyms is done by the search engine and not by their
algorithm, outlining the importance of these search engines.
The major flaws of this paper are the restricted set of queries given, the
comparison of PMI-IR and LSA that is not possible without inaccuracy and the
outlining of various future applications of the PMI-IR algorithm without
ever giving them a try (I think that it is not a good idea to mention all
future research interests when they are not fully developed).
In addition, I am missing an accurate explanation of the algorithm.
Furthermore, they concentrate too much on applications and forget to mention
more points on the machine learning.
Well, the open research questions are totally clear:
- implementing and testing all the suggestion they give in their
paper
- increasing the query set to more difficult words
- decreasing the running time by multi-threading and reducing network
traffic
- Why did they choose the TOEFL and not the more difficult CSE as the
query source?
- Can this idea be implemented on other semantic questions?
This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 09:34:29 PST