CSE 546 - Data Mining - Spring 2008

CSE 546: Data Mining (Spring 2008)

Assignment 2: Rule Induction and Instance-based Learning

Due Date: Monday, May 5, 2008 in class and submit code online (the sixth week)

(Mitchell 10.1) Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of n boolean attributes. If ID3 learns a balanced decision tree of depth d, it will contain 2^d-1 distinct decisions nodes, and therefore will have made 2^d-1 distinct choices while constructing its output hypothesis. How many rules will be formed if this tree is re-expressed as a disjunctive set of rules? How many preconditions will each rule possess? How many distinct choices would a sequential covering algorithm have to make to learn this same set of rules? Which system do you suspect would be more prone to overfitting if both were given the same training set?
Apply FOIL to relational learning:
- Download FOIL6 from here. (It also contains a picture of our influential alumnus.) Decompress the shell archive and understand how FOIL works by checking out the following files: README, MANUAL, and member.explain. You can also try running FOIL with some example files (e.g., sort.d).
- Download the kinship data file. Run FOIL and report the rules you have learned and accuracy on the test relations. This file is created from the original Kinship dataset by random split. You can check out Quinlan's paper (Section 5.1) for more details of the dataset and experiment setting. The command is
  ./foil6 -v0 < kin.d
  Here, -v controls the verbosity level; you can use higher numbers (1-6) for more detailed output.
- Suppose that you are learning a rule for brother(A,B). Your current candidate is brother(A,B) :- wife(C,A) ^ daughter(E,A), and you are considering candidate specialization by adding another literal to the antecedent. How many distinct literals can be generated? Two literals are considered identical if they differ only in the names of the new variables that they contain.
Suppose the instance space is the square -10 < X,Y < 10, and you have a training set composed of positive examples at (0, 0), (0, 5), (0, -5), and negative examples at (-5, -5), (-5, 0), (-5, 5), (5, 0), (5, -5), (5, 5).
(a) Draw the Voronoi diagram of this training set, using Euclidean distance.
(b) Suppose you measure the error rate of the 3-nearest neighbor algorithm on this training set using leave-one-out cross-validation (i.e., you take out each example in turn, and predict its class using the other examples). What would the measured error rate be? If some example has more than 3 examples at the same nearest distance from it such that different choices of the 3 nearest neighbors give different predictions then count this example as an error. You thus compute the worst-case error rate.
(c) Suppose you apply the 3-nearest-neighbor algorithm with backward elimination. Which features would be eliminated (X, Y, both, or neither)? Why?
Collaborative Filtering on Netflix Ratings:
- Read the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering. You need to read up to Section 2.1, and are encouraged to read further if you have time.
- The dataset we will be using is a subset of the movie ratings data from the Netflix Prize. You can download it here. It contains a training set, a test set, a movies file, a dataset description file, and a README file. The training and test sets are both subsets of the Netflix training data. You will use the ratings provided in the training set to predict those in the test set. You will compare your predictions with the actual ratings provided in the test set. The evaluation metrics you need to measure are the Mean Absolute Error and the Root Mean Squared Error. The dataset description file further describes the dataset, and will help you get started. The README file is from the original set of Netflix files, and has been included to comply with the terms of use for this data.
- Implement the collaborative filtering algorithm described in Section 2.1 of the paper (Equations 1 and 2; ignore Section 2.1.2) for making the predictions. You may program in C, C++, Java or Matlab. If you'd like to use another language, ask Hoifung first.
- (10% Extra Credit) Add yourself as a new user to the training set. To do this, you will need to create a new user ID for yourself. Select some movies that you have seen among those in the training set, and add your ratings for them to the training set. Extend your system to output predictions for the movies you haven't rated, ranked in order of decreasing ratings. Do you agree with the predictions of your system? Check out some of the top ranked movies that you haven't seen (but only after you have finished work on the project).
- Turn in the following:
  - Your code for Problem 4. Submit your files here. The due date is 12PM, May 5, 2008. Your code should contain appropriate comments to facilitate understanding.
  - A report with your answers to the homework problems. For problem 4, you should describe in high-level how your code works, what is the accuracy of your results; if your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.

CSE 546: Data Mining (Spring 2008)

Assignment 2: Rule Induction and Instance-based Learning

Due Date: Monday, May 5, 2008 in class and submit code online (the sixth week)

Good luck, and have fun!