## CSEP 546: Data Mining (Autumn 2017) Assignment 3: Neural Networks & Ensemble Methods

Due: Sunday, November 19, 2017 at 11:59pm PST.

• Please submit both code and writeup at the submission dropbox.
• Provide all source code (sufficiently commented) so that we can run it ourselves.
• For the programming question you may use C/C++/C#, Java, Python or MATLAB.
• You may use libraries for parsing the data, but you should include any non-standard libraries used in your submission. You may use standard libraries such as "math.h", "scipy" and "numpy" without including them in your submission.
• Do not submit the datasets provided to you.
• The writeup should be a typed PDF limited to four pages with reasonable fonts and margins. Submit your writeup as a separate file (i.e., not inside an archive file containing your code).
• Check the discussion board for answers to questions you may have about the assignment or post new questions.
• Follow the discussion board and announcements page for updates and clarifications.

### Problem 1: Neural Networks and Backpropagation

For this question you will implement backpropagation and train a multi-layer perceptron (MLP) to classify digit images using the classical MNIST dataset.

• Download the dataset of the projections of MNIST to its top 50 PCA directions.
Note: While we do not encourage using a validation set for this assignment, if you do insist on using one please use this data set with a validation set, where the validation set was sampled uniformly at random from the training set in a 5:1 split.
The original MNIST dataset is composed of 60k training images and 10k test images, where the digits have been centered inside 28x28 pixel images. In order to speed up the computation, you will use a dataset where each sample's dimension was reduced from 784 to 50 by projecting it the top 50 PCA directions of the training data (we will learn about PCA later in class). The format of the projected dataset (idx2-double) is as follows (the label files are unmodified):

[offset][type][value][description]
000032 bit integer0x00000802(2050) magic number (MSB first)
000432 bit integer60000 (or 10000)number of samples
000832 bit integer50number of dimensions
001264 bit double??component score (1st component of 1st image projection)
002064 bit double??component score (2nd component of 1st image projection)
........
xxxx64 bit double??component score (last component of last image projection)

• Implement the the backpropagation algorithm, as described in Chapter 4 of Mitchell or Chapter 6 of Duda et al.

• Use your program to train a neural network with 1 hidden layer and a sigmoid transfer function using the squared loss as the loss function at the top layer.
• The input layer should have 50 dimensions (as the data is 50 dimensions after PCA).
• The output should simply consist of the predictions of the network as a one-hot vector. Therefore, it should have 10 dimensions. We will make the output layer - the ten predictions made by the network - be a sigmoid applied to a linear combination of the hidden layer outputs.
• The hidden layer should consist of $n_h$ nodes with a sigmoid transfer function. Train networks with $n_h$ = 10, 50, 100, 500 and 1000.

Try finding a learning rate (or learning rate scheme), mini-batch size and initialization scheme that works best. You may also add a momentum term to your optimization to accelerate it. Additionally, you may add a regularization term to avoid overfitting (or use other methods, such as early stopping). Note that sophisticated addition of a momentum or regularization term should involve only minimal changes to your code, but may improve the computation significantly (in either quality of the results or runtime).
Note that should you compare your test errors to those of the neural network results published on the webpage of the original dataset (for comparable network architectures), you should be able to significantly improve upon them.

• Use your program to train a neural network with 1 hidden layer and Rectified Linear Units (ReLU) using the squared loss as the loss function at the top layer. That is, build a network whose architecture is the same as before, but replace the sigmoid with the rectifier transfer fuction which is defined as $R(x) = \max(0,x)$. Train only one such network whose number of nodes at the hidden layer is the same as the best performing network from the previous part.

#### Problem writeup:

1.1 Write a paragraph describing your design choices. In particular, specify all your parameter choices: your learning rate (or learning rate scheme), mini-batch size, initialization scheme.

1.2 For each of the 6 networks trained (5 different values of $n_h$ and the ReLU network), plot the squared loss after every half epoch (starting with your initial squared error). Please label your axes in a readable way. Plot the loss of both the training and test losses on the same plot.

1.3 Do the same as 1.2 for the 0/1 loss (i.e. 1 - accuracy), but this time start when the loss is below 7% (or when $\frac{2}{3}$ of epochs have elapsed, whichever comes first) to make to plot more readable.

1.4 What is your final squared loss and 0/1 loss for both the training and test sets for each network?

1.5 How does using ReLUs compare to using the sigmoid function? Why?

### Problem 2: Ensemble Methods

Consider an ensemble learning algorithm that uses simple majority voting among $K$ learned hypotheses. Suppose that each hypothesis has error $\epsilon$ and that the errors made by each hypothesis are independent of the others'.

2.1 Calculate a formula for the error of the ensemble algorithm in terms of $K$ and $\epsilon$, and evaluate it for the case where $K$ = 5, 10 and 20 and $\epsilon$ = 0.1, 0.2 and 0.4.

2.2 If the independence assumption is removed, is it possible for the ensemble error to be worse than $\epsilon$? Justify you answer.