Due: Sunday, November 19, 2017 at 11:59pm PST.
For this question you will implement backpropagation and train a multi-layer perceptron (MLP) to classify digit images using the classical MNIST dataset.
[offset] | [type] | [value] | [description] |
---|---|---|---|
0000 | 32 bit integer | 0x00000802(2050) | magic number (MSB first) |
0004 | 32 bit integer | 60000 (or 10000) | number of samples |
0008 | 32 bit integer | 50 | number of dimensions |
0012 | 64 bit double | ?? | component score (1st component of 1st image projection) |
0020 | 64 bit double | ?? | component score (2nd component of 1st image projection) |
........ | |||
xxxx | 64 bit double | ?? | component score (last component of last image projection) |
Implement the the backpropagation algorithm, as described in Chapter 4 of Mitchell or Chapter 6 of Duda et al.
Try finding a learning rate (or learning rate scheme), mini-batch size and initialization scheme that works best. You may also add a momentum term to your optimization to accelerate it. Additionally, you may add a regularization term to avoid overfitting (or use other methods, such as early stopping). Note that sophisticated addition of a momentum or regularization term should involve only minimal changes to your code, but may improve the computation significantly (in either quality of the results or runtime).
Note that should you compare your test errors to those of the neural network results published on the webpage of the original dataset (for comparable network architectures), you should be able to significantly improve upon them.
Use your program to train a neural network with 1 hidden layer and Rectified Linear Units (ReLU) using the squared loss as the loss function at the top layer. That is, build a network whose architecture is the same as before, but replace the sigmoid with the rectifier transfer fuction which is defined as $R(x) = \max(0,x)$. Train only one such network whose number of nodes at the hidden layer is the same as the best performing network from the previous part.
1.1 Write a paragraph describing your design choices. In particular, specify all your parameter choices: your learning rate (or learning rate scheme), mini-batch size, initialization scheme.
1.2 For each of the 6 networks trained (5 different values of $n_h$ and the ReLU network), plot the squared loss after every half epoch (starting with your initial squared error). Please label your axes in a readable way. Plot the loss of both the training and test losses on the same plot.
1.3 Do the same as 1.2 for the 0/1 loss (i.e. 1 - accuracy), but this time start when the loss is below 7% (or when $\frac{2}{3}$ of epochs have elapsed, whichever comes first) to make to plot more readable.
1.4 What is your final squared loss and 0/1 loss for both the training and test sets for each network?
1.5 How does using ReLUs compare to using the sigmoid function? Why?
2.1 Calculate a formula for the error of the ensemble algorithm in terms of $K$ and $\epsilon$, and evaluate it for the case where $K$ = 5, 10 and 20 and $\epsilon$ = 0.1, 0.2 and 0.4.
2.2 If the independence assumption is removed, is it possible for the ensemble error to be