Paper Review 3

From: Indriyati Atmosukarto (indria@cs.washington.edu)
Date: Wed Dec 08 2004 - 08:44:48 PST

  • Next message: Jon Froehlich: "Review3-Mining the Web for Synonyms"

    Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
    Peter D. Turney

    This paper describes a form of unsupervised learning algorithm
    which combines the use of Pointwise Mutual Information (PMI)
    with Information Retrieval (IR) to measure the similarity
    of pairs of words. The paper went on to compare the result of
    the algorithm's performance against LSA and shows that PMI-IR
    performed better than LSA on the TOEFL dataset.

    It was interesting to see how the author combined the use of
    PMI with IR. The algorithm is based on the co-occurence of the
    problem word with the choice words .The PMI score for each choice
    of words are calculated using probabilities for problem word and
    choice words. These probabilities are calculated based on the
    results retrieved by the IR queries. The author also came up with
    four ways of calculating the score for each choice words ranging
    from a simple interpretation to a more refined interpretation of
    word co-occurence by using NEAR operator from the search engine
    and taking the context of the words into consideration as well.

    Though the idea seems quite interesting but the lack of experiments
    results fails to convince me of the algorithms's true performance.
    Only 80 TOEFL questions and 50 ESL test questions were used in the
    experiments which when you think about it is a very small dataset
    considering the fact that TOEFL has been around for years. In addition,
    the paper only presented the result of LSA on the TOEFL dataset but
    not on the ESL dataset. More evaluation on the different performance of the
    different scoring techniques would have been appreciated as well.

    The first question that occur to my mind when I first read the introduction
    was why did the author use AltaVista search engine to retrieve the information
    for document collection. I would be interested to see whether there would be
    any difference in the algorithm's performance if the author used Google or
    MSN search to retrieve the information. Especially since Google has a number
    of operators that can be used to constraint the search queries. It would also be
    interesting to see how the algorihtm fairs on other types of tests such as the
    GRE verbal test which is known to be more difficult that TOEFL. The verbal test
    not only contains synonym questions, but also antonyms, analogies, sentence
    completion and the most difficult is reading comprehension section.


  • Next message: Jon Froehlich: "Review3-Mining the Web for Synonyms"

    This archive was generated by hypermail 2.1.6 : Wed Dec 08 2004 - 08:44:49 PST