Due Date: Thu, Feb 4, 2010 copy of report in class and submit code online (the 5th week)
|
- (Mitchell 10.1) Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of n boolean attributes. If ID3 learns a balanced decision tree of depth d, it will contain 2^d-1 distinct decisions nodes, and therefore will have made 2^d-1 distinct choices while constructing its output hypothesis. How many rules will be formed if this tree is re-expressed as a disjunctive set of rules? How many preconditions will each rule possess? How many distinct choices would a sequential covering algorithm have to make to learn this same set of rules? Which system do you suspect would be more prone to overfitting if both were given the same training data?
- Apply FOIL to relational learning:
- Download FOIL6 from here. (It also contains a picture of our influential alumnus.) Decompress the shell archive and understand how FOIL works by checking out the following files: README, MANUAL, and member.explain. You can also try running FOIL with some example files (e.g., sort.d).
- Download the kinship data file. Run FOIL and report the rules you have learned and accuracy on the test relations. This file is created from the original Kinship dataset. This dataset contains atoms that describe various kinship relations among a group of people. All predicates denote binary relations. For example, son(A,B) means that A is the son of B, brother(C,D) means that C is the brother of D, and so on. We want to learn first-order rules that depict the relations among various kinship. Namely, we want to learn rules such as:
mother(A,B) :- wife(A,C), father(C,B).
You can check out Quinlan's paper (Section 5.1) for more details of the dataset and experiment setting. The command is
./foil6 -v0 < kinship.txt
Here, -v controls the verbosity level; you can use higher numbers (1-6) for more detailed output.
- Suppose that you are learning a rule for daughter(A,B). Your current candidate is daughter(A,B) :- sister(A,C), and you are considering adding a literal to the antecedent. How many distinct literals can be generated? Two literals are considered identical if they differ only in the names of the new variables that they contain.
- Compute the FOIL information gain for adding each of the following literals: son(C,B), wife(C,B), mother(C,B). Which one is the highest? Does it make sense? (You can write a program or compute it by hand.)
-
Suppose the instance space is the square 0 < X,Y < 10,
and you have a training set composed of positive examples at (2, 2), (2, 4),
(2, 6), (2, 8), (4, 5) and negative examples at (4, 3), (4, 7), (6, 4), (6, 6), (8, 5).
(a) Draw the Voronoi diagram of this training set, using Euclidean distance.
(b) List the examples you could remove without changing the frontiers produced by the nearest-neighbor algorithm.
(c) Suppose you measure the error rate of the 3-nearest neighbor
algorithm on this training set using leave-one-out cross-validation
(i.e., you take out each example in turn, and predict its class using
the other examples). If some example has more than 3 examples at the
same nearest distance from it such that different choices of the 3
nearest neighbors give different predictions then count this example
as an error. You thus compute the worst-case error rate. What would
the measured error rate be? Suppose you apply the 3-nearest-neighbor
algorithm with backward elimination. Which features would be
eliminated (X, Y, both, or neither)? Why?
- Collaborative Filtering on Netflix Ratings:
- Read the paper Empirical Analysis
of Predictive Algorithms for Collaborative Filtering. You need to
read up to Section 2.1, and are encouraged to read further if you have
time.
- The dataset we will be using is a subset of the movie
ratings data from
the Netflix Prize. You can download it here. It contains a training set, a test set, a movies file, a dataset description file, and a README file. The training and test sets are both subsets of the Netflix training data. You will use the ratings provided in the training set to predict those in the test set. You will compare your predictions with the
actual ratings provided in the test set. The evaluation metrics you need to measure are the Mean Absolute Error and the Root Mean Squared Error. The dataset description file further describes the dataset, and will help you get started. The README file is from the original set of Netflix files, and has been included to comply with the terms of use for this data.
- Implement the collaborative filtering algorithm described
in Section 2.1 of the paper (Equations 1 and 2; ignore Section 2.1.2)
for making the predictions. You may program in C, C++, Java. If
you'd like to use another language, ask TA first.
- (10% Extra Credit) Add yourself as a new user to the
training set. To do this, you will need to create a new user ID for yourself. Select some movies that you have seen among those in the training set, and add
your ratings for them to the training set. Extend your system to output
predictions for the movies you haven't rated, ranked in order of decreasing ratings. Do you agree with the predictions of your system? Check out some of the top ranked movies that you haven't seen (but only after you have finished work on the project).
Turn in the following:
- Your code for Problem 4. Submit your files here. The due date is 12Noon, Feb 4, 2010. Your code should contain appropriate comments to facilitate understanding.
- A report with your answers to the homework problems. For problem 4, you should describe in high-level how your code works, what is the accuracy of your results; if your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.
Good luck, and have fun!
|