CSEP 546: Data Mining (Autumn 2017)
Assignment 2: Collaborative Filtering and Bayesian Networks

Due: Monday, November 6th, 2017 at 12:01am PST.

Problem 1: Collaborative Filtering on Netflix Ratings

1.0 Read the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering. You need to read up to Section 2.1, and are encouraged to read further if you have time.

1.1 The dataset we will be using is a subset of the movie ratings data from the Netflix Prize. You can download it here. It contains a training set, a test set, a movies file, a dataset description file, and a README file. The training and test sets are both subsets of the Netflix training data. You will use the ratings provided in the training set to predict those in the test set. You will compare your predictions with the actual ratings provided in the test set. The evaluation metrics you need to measure are the Mean Absolute Error and the Root Mean Squared Error. The dataset description file further describes the dataset, and will help you get started. The README file is from the original set of Netflix files, and has been included to comply with the terms of use for this data.

1.2 Implement the collaborative filtering algorithm described in Section 2.1 of the paper (Equations 1 and 2; ignore Section 2.1.2) for making the predictions.

Extra-credit: Add yourself as a new user to the training set. To do this, you will need to create a new user ID for yourself. Select some movies that you have seen among those in the training set, and add your ratings for them to the training set. Extend your system to output predictions for the movies you haven't rated, ranked in order of decreasing ratings. Do you agree with the predictions of your system? Check out some of the top ranked movies that you haven't seen (but only after you have finished work on the project).

Problem writeup:

Problem 2: Bayesian Networks

2.1 You are managing an intern hoping to launch a product, and you know the following:

Please answer the following questions.
  1. Draw the Bayesian network consistent with the statements above using variables LaunchSuccess, CodeQuality, TestCoverage, Deadline, and HasCoffee.
  2. According to this network, if the intern writes tests, is the code more likely to be good? Why or why not?
  3. According to this network, are the code quality and deadline independent? Why or why not?
  4. According to this network, are the code quality and deadline independent given a successful launch? Why or why not?
  5. According to this network, are the code quality and deadline independent given that the intern did not get any coffee? Why or why not?

2.2 Suppose you have 4 binary variables: $a$, $b$, $c$, and $d$. Consider the Bayesian network structured as follows $a \rightarrow b \rightarrow c \rightarrow d$. The network has the following parameters:

Please answer the following questions:
  1. Compute the probability of $P(d)$. (hint: you need to marginalize out the probabilities of $a$, $b$, and, $c$)
  2. How many summations did you need to compute? Can you do better with a different ordering of summations?
  3. Suppose we have a Bayesian network with a similar structure but with $n$ binary variables in a chain instead, i.e. $x_1 \rightarrow x_2 \rightarrow x_3 \rightarrow \ldots \rightarrow x_n$. What is the fewest number of sums needed to compute $P(x_n)$?
  4. What is the worst-case number of sums needed to compute $P(x_n)$?

2.3 Consider again the above chain-structured Bayesian network, $a \rightarrow b \rightarrow c \rightarrow d$. Suppose you have a training set composed of the following examples in the form $(a,b,c,d)$, with "?" indicating a missing value: $(0,1,1,1)$, $(1,1,0,0)$, $(1,0,0,0)$, $(1,0,1,1)$, and $(1,?,0,1)$. Show the first iteration of EM algorithm (initial parameters, E-step, M-step), assuming the parameters are initialized ignoring missing values. Note that this problem does not use the parameters described in problem 2.2.