CSEP 546: Data Mining (Autumn 2017)
Assignment 2: Collaborative Filtering and Bayesian Networks
Due: Monday, November 6th, 2017 at 12:01am PST.
- Please submit both code and writeup at the submission dropbox.
- Provide all source code (sufficiently commented) so that we can run it ourselves.
- For the programming question you may use C/C++/C#, Java, Python or MATLAB. However, Python and MATLAB are highly recommended, as they are higher level languages.
- You may use libraries for parsing the data, but you should include any non-standard libraries used in your submission. You may use standard libraries such as "math.h", "scipy" and "numpy" without including them in your submission.
- The writeup should be a typed PDF limited to four pages with reasonable fonts and margins. Submit your writeup as a separate file (i.e., not inside an archive file containing your code).
- Check the discussion board for answers to questions you may have about the assignment or post new questions.
- Follow the discussion board and announcements page for updates and clarifications.
Problem 1: Collaborative Filtering on Netflix Ratings
1.0 Read the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering. You need to read up to Section 2.1, and are encouraged to read further if you have time.
1.1 The dataset we will be using is a subset of the movie ratings data from the Netflix Prize. You can download it here. It contains a training set, a test set, a movies file, a dataset description file, and a README file. The training and test sets are both subsets of the Netflix training data. You will use the ratings provided in the training set to predict those in the test set. You will compare your predictions with the actual ratings provided in the test set. The evaluation metrics you need to measure are the Mean Absolute Error and the Root Mean Squared Error. The dataset description file further describes the dataset, and will help you get started. The README file is from the original set of Netflix files, and has been included to comply with the terms of use for this data.
1.2 Implement the collaborative filtering algorithm described in Section 2.1 of the paper (Equations 1 and 2; ignore Section 2.1.2) for making the predictions.
Extra-credit: Add yourself as a new user to the training set. To do this, you will need to create a new user ID for yourself. Select some movies that you have seen among those in the training set, and add your ratings for them to the training set. Extend your system to output predictions for the movies you haven't rated, ranked in order of decreasing ratings. Do you agree with the predictions of your system? Check out some of the top ranked movies that you haven't seen (but only after you have finished work on the project).
Problem writeup:
- A high-level description on how your code works.
- The accuracy you obtain
- If all your accuracies are low, tell us what you have tried to improve the accuracies and what you suspect is failing.
- Regardless of whether your accuracies are good or bad, what are the shortcomings of the collaborative filtering algorithm and how might you augment it?
Problem 2: Bayesian Networks
2.1
You are managing an intern hoping to launch a product, and you know the following:
- If the intern writes good code and has enough tests, the launch is more likely to be successful.
- If coffee is available, the intern writes better code.
- If coffee is available, the intern is more likely to write tests.
- Having a tight launch deadline results in the intern writing fewer tests
Please answer the following questions.
- Draw the Bayesian network consistent with the statements above using variables LaunchSuccess, CodeQuality, TestCoverage, Deadline, and HasCoffee.
- According to this network, if the intern writes tests, is the code more likely to be good? Why or why not?
- According to this network, are the code quality and deadline independent? Why or why not?
- According to this network, are the code quality and deadline independent given a successful launch? Why or why not?
- According to this network, are the code quality and deadline independent given that the intern did not get any coffee? Why or why not?
2.2
Suppose you have 4 binary variables: $a$, $b$, $c$, and $d$. Consider the Bayesian network structured as follows $a \rightarrow b \rightarrow c \rightarrow d$. The network has the following parameters:
- $P(a) = 0.6$
- $P(b | a) = 0.9$
- $P(b | \neg a) = 0.2$
- $P(c | b) = 0.3$
- $P(c | \neg b) = 0.7$
- $P(d | c) = 0.4$
- $P(d | \neg c) = 0.7$
Please answer the following questions:
- Compute the probability of $P(d)$. (hint: you need to marginalize out the probabilities of $a$, $b$, and, $c$)
- How many summations did you need to compute? Can you do better with a different ordering of summations?
- Suppose we have a Bayesian network with a similar structure but with $n$ binary variables in a chain instead, i.e. $x_1 \rightarrow x_2 \rightarrow x_3 \rightarrow \ldots \rightarrow x_n$. What is the fewest number of sums needed to compute $P(x_n)$?
- What is the worst-case number of sums needed to compute $P(x_n)$?
2.3
Consider again the above chain-structured Bayesian network, $a \rightarrow b \rightarrow c \rightarrow d$. Suppose you have a training set composed of the following examples in the form $(a,b,c,d)$, with "?" indicating a missing value: $(0,1,1,1)$, $(1,1,0,0)$, $(1,0,0,0)$, $(1,0,1,1)$, and $(1,?,0,1)$. Show the first iteration of EM algorithm (initial parameters, E-step, M-step), assuming the parameters are initialized ignoring missing values. Note that this problem does not use the parameters described in problem 2.2.