Paper Review 3

From: Gaurav Bhaya (gbhaya@cs.washington.edu)
Date: Tue Dec 07 2004 - 20:53:44 PST

Next message: Jiun-Hung Chen: "paper review #3"

Previous message: Xu Miao: "test2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Peter D. Turney

--------------------------------------------------------

Short Summary: This paper introduces an unsupervised learning

algorithm based on web query for finding synonyms of a given

word. It uses the popular search engine which contains an index

of millions of words to find the probability of co-occurrence of

the given words.

Important Ideas:

- One of the most important idea presented in the paper is using

the vast index of millions and billions of documents indexed by

a search engine. In this was it not only exploits a large base

of documents but also exploits computational power of search engine.

- The second point the paper proves that most complex problems have

a really simple solution. By doing simple operations the PMI-IR

is able to obtain better results than LSA on TOEFL questions.

- I thought they way the paper derived four metrics for related

words was very interesting. It raised a question whether there

exists an "even better" metric given the performance difference

between the four presented in the paper.

- I esp. liked the way in which the paper represented closeness of

the two words in the last of the 4 metrics. Using the idea of context

seems very interesting.

Flaws:

- Are TOEFL questions really a good choice for the problem. Most

TOEFL questions are designed such that humans would tend to choose

one of the wrong choices based on some guessing pattern. This

guessing pattern may not represent the co-occurrence of the words in

any way. Furthermore, it may be possible to generate a set of questions

that may be difficult for the above algorithm using two words that

occur in a phrase.

- The paper compares the computation time to query the web server.

What about the time taken by web Server to crawl all the web pages

and build an index. Is the time comparison between two algorithms

fair?

- Isn't the sample size too small to make a conclusion. 74% accuracy on

not even 100 items!!

Open Questions:

- Can this technique be used when the alternatives are not given? Can

such a tool be designed that finds synonyms of a given word given

just the word itself? Unless this is possible, hand coded dictionaries

cannot be eliminated.

- How can this idea aid in solving the crossword puzzle that we saw

earlier. Would this technique be good enough to replace all the

expert modules in the previous paper- since it exploits something

more than the synonyms.

- The paper points out various other future work possibilities towards

the end. Some of these include extending to LSI, using a smaller index.

Next message: Jiun-Hung Chen: "paper review #3"

Previous message: Xu Miao: "test2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 07 2004 - 20:53:52 PST