CSE 446: Machine Learning (Winter 2009)

Assignment 2: Rule Induction and Instance-based Learning

Due Date: Friday, February 6, 2009 in class and submit code online (the fifth week)

  1. What is the worst-case time complexity of the sequential covering algorithm (separate-and-conquer) for rule induction? Assume that there are n examples and d attributes.

  2. Relational learning with FOIL:

    • Download the kinship dataset, which is created from the original Kinship dataset. This dataset contains atoms that describe various kinship relations among a group of people. All predicates denote binary relations. For example, son(A,B) means that A is the son of B, brother(C,D) means that C is the brother of D, and so on. We want to learn first-order rules that depict the relations among various kinship. Namely, we want to learn rules such as:

      mother(A,B) :- wife(A,C), father(C,B).

    • Suppose that you are learning a rule for son(A,B). Your current candidate is son(A,B) :- brother(A,C), and you are considering adding a literal to the antecedent. How many distinct literals can be generated? Two literals are considered identical if they differ only in the names of the new variables that they contain.

    • Compute the FOIL information gain for adding each of the following literals: daughter(C,B), wife(C,B), father(C,B). Which one is the highest? Does it make sense? (You can write a program or compute it by hand.)

  3. K-nearest neighbor for text classification:

    • The goal of text classification is to identify the topic for a piece of text (news article, webblog, etc.). Text classification has obvious utility in the age of information overload, and it has become a popular turf for applying machine learning algorithms. In this project, you will have the opportunity to implement k-nearest neighbor and apply it to text classification on the well known Reuter news collection.

    • Download the dataset, which is created from the original collection and contains a training file, a test file, the topics, and the format for train/test.

    • Implement the k-nearest neighbor algorithm for text classification. Your goal is to predict the topic for each news article in the test set. Try the following distance or similarity measures with their corresponding representations.

      • (a) Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.

      • (b) Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document (it could be zero).

      • (c) Cosine similarity with TF-IDF weights (a popular metric in information retrieval): each document is represented by a numeric vector as in (b). However, now each number is the TF-IDF weight for the corresponding word (as defined below). The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

    • Let w be a word, d be a document, and N(d,w) be the number of occurrences of w in d (i.e., the number in the vector in (b)). TF stands for term frequency, and TF(d,w)=N(d,w)/W(d), where W(d) is the total number of words in d. IDF stands for inverted document frequency, and IDF(d,w)=log(D/C(w)), where D is the total number of documents, and C(w) is the total number of documents that contains the word w; the base for the logarithm is irrelevant, you can use e or 2. The TF-IDF weight for w in d is TF(d,w)*IDF(d,w); this is the number you should put in the vector in (c). TF-IDF is a clever heuristic to take into account of the "information content" that each word conveys, so that frequent words like "the" is discounted and document-specific ones are amplified. You can find more details about it online or in standard IR text.

    • You should try k = 1, k = 3 and k = 5 with each of the representations above. Notice that with a distance measure, the k-nearest neighborhoods are the ones with the smallest distance from the test point, whereas with a similarity measure, they are the ones with the highest similarity scores.

  4. Turn in the following:

    • Your code for Problem 3. Submit your files here. The due date is 11:30AM, February 6, 2009. Your code should contain appropriate comments to facilitate understanding.

    • A report with your answers to the homework problems. For problem 3, you should describe in high-level how your code works, your results, and what you find interesting or surprising in the results. Which distance/number of nearest neighbors work the best, why do you think this is the case?

Good luck, and have fun!