Due Date: Monday, May 5, 2008 in class and submit code online (the sixth week)
|
- (Mitchell 10.1) Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of n boolean attributes. If ID3 learns a balanced decision tree of depth d, it will contain 2^d-1 distinct decisions nodes, and therefore will have made 2^d-1 distinct choices while constructing its output hypothesis. How many rules will be formed if this tree is re-expressed as a disjunctive set of rules? How many preconditions will each rule possess? How many distinct choices would a sequential covering algorithm have to make to learn this same set of rules? Which system do you suspect would be more prone to overfitting if both were given the same training set?
- Apply FOIL to relational learning:
- Suppose the instance space is the square -10 < X,Y < 10,
and you have a training set composed of positive examples at (0, 0), (0, 5),
(0, -5), and negative examples at (-5, -5), (-5, 0), (-5, 5), (5, 0), (5, -5), (5, 5).
(a) Draw the Voronoi diagram of this training set, using Euclidean distance.
(b) Suppose you measure the error rate of the 3-nearest neighbor
algorithm on this training set using leave-one-out cross-validation (i.e.,
you take out each example in turn, and predict its class using the other
examples). What would the measured error rate be? If some example
has more than 3 examples at the same nearest distance from it such that
different choices of the 3 nearest neighbors give different predictions
then count this example as an error. You thus compute the worst-case
error rate.
(c) Suppose you apply the 3-nearest-neighbor algorithm with
backward elimination. Which features would be eliminated (X, Y,
both, or neither)? Why?
- Collaborative Filtering on Netflix Ratings:
- Read the paper Empirical Analysis
of Predictive Algorithms for Collaborative Filtering. You need to
read up to Section 2.1, and are encouraged to read further if you have
time.
- The dataset we will be using is a subset of the movie
ratings data from
the Netflix Prize. You can download it here. It contains a training set, a test set, a movies file, a dataset description file, and a README file. The training and test sets are both subsets of the Netflix training data. You will use the ratings provided in the training set to predict those in the test set. You will compare your predictions with the
actual ratings provided in the test set. The evaluation metrics you need to measure are the Mean Absolute Error and the Root Mean Squared Error. The dataset description file further describes the dataset, and will help you get started. The README file is from the original set of Netflix files, and has been included to comply with the terms of use for this data.
- Implement the collaborative filtering algorithm described
in Section 2.1 of the paper (Equations 1 and 2; ignore Section 2.1.2)
for making the predictions. You may program in C, C++, Java or Matlab. If
you'd like to use another language, ask Hoifung first.
- (10% Extra Credit) Add yourself as a new user to the
training set. To do this, you will need to create a new user ID for yourself. Select some movies that you have seen among those in the training set, and add
your ratings for them to the training set. Extend your system to output
predictions for the movies you haven't rated, ranked in order of decreasing ratings. Do you agree with the predictions of your system? Check out some of the top ranked movies that you haven't seen (but only after you have finished work on the project).
-
Turn in the following:
- Your code for Problem 4. Submit your files here. The due date is 12PM, May 5, 2008. Your code should contain appropriate comments to facilitate understanding.
- A report with your answers to the homework problems. For problem 4, you should describe in high-level how your code works, what is the accuracy of your results; if your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.
Good luck, and have fun!
|