Review of " Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL"

    The paper read is titled "Mining the Web for Synonyms: PMI-IR versus LSA
    on TOEFL" by Turney. It discusses using Pointwise Mutual Information
    (PMI) and Information Retrieval (IR) to guess synonyms and compares it
    to a pervious algorithm, Latent Semantic Analysis (LSA).

    The Internet is a huge corpus of data for a machine to learn from.
    However, most of it is not marked in a way that is useful for supervised
    learning. It could be very useful, however, for an algorithm that could
    use it to perform unsupervised learning. One of the strengths of this
    algorithm is that it is able to learn from a large corpus of data
    without needing a human to guide it or give it feedback, and it still
    does reasonably well. The algorithm is able to analyze and come up with
    useful responses for test questions from the TOEFL, even to the point
    where is does a bit better than some of the humans who take the test,
    and, given that each question takes about sixteen seconds, probably
    faster than some humans. Another strength of the paper lies in the fact
    that it leverages the existing technology provided by AltaVista. This
    allows the actual algorithm itself to be fairly concise, elegant, and

    A weakness of this paper comes up in the comparison with LSA. Although
    PMI-IR does significantly better than LSA on the TOEFL, it had a much
    larger data set to draw information from. Although it might take a
    while to crunch the SVD, it would be a fairer comparison to test against
    an LSA that had used a corpus of equal size. Also, although the paper
    mentions the average score of students from non-English speaking
    countries, it doesn’t mention the overall average or that of students
    from English speaking countries, which are presumably higher. It would
    be interesting for comparison to see how PMI-IR actually does against an
    average native English speaker, to see if getting about 75% correct is

    One open question from the paper is: can similar unsupervised learning
    algorithms be developed for different purposes? Given that we have
    access to the vast amount of data on the internet, is it possible to
    answer other questions beyond what the synonym of a word is? Would it
    be possible to take advantage of specialized information on the
    Internet, rather that just looking at it as a bunch of words?

    Another interesting idea to apply to this would be something along the
    lines of ensembles of classifiers, using various search engines. The
    paper only used AltaVista, but there are other alternatives, which might
    give different scores to the choices. Queries could be run on several
    search engines, and then they could vote on which one they thought was
    best. This might also relate to the PROVERB paper, because the results
    from the various search engines would have to be combined in some
    meaningful way, just like the results from the various expert modules.

