Lecture1
Reading: 
Required: Mitchell Chapter 1
Required: Hulten Chapter 1-3, 16
Required: Hulten Part of Chapter 19 pp225 - 229 (stop at ‘distribution of mistakes’)

0) Setup
You will need to set up your environment prepare for later assignments. This involves 
installing the relevant tools and running the sample code on sample data.
First, get familiar with python 3.x: https://www.learnpython.org/
No way Python!?!?
Python is widely used in machine learning and data science to process, prepare, & 
explore data; and to stitch together ML tools into experiments/work flows.
Keep in mind that we are only going to be programming very basic versions of the 
various algorithms, and there will be very little need for optimization.
I provide python framework code to get you started on the assignments, deal with data 
loading, etc.
[ If you really think python will be a problem for you, let me know. ]
 
Next install the tools: 
Install visual studio community: https://visualstudio.microsoft.com/thank-you-
downloading-visual-
studio/?sku=Community&rel=15&rid=34347&utm_source=docs&utm_medium=clickbu
tton&utm_campaign=python_gettingstarted#
Install the with the python option (as described here: https://docs.microsoft.com/en-
us/visualstudio/python/installing-python-support-in-visual-studio?view=vs-
2017) (Links to an external site.)
You are welcome to use a different python environment, but we haven't tested others, 
and can't offer support for them.
 
Finally, execute the assignment 1 framework:
Download (Links to an external site.) the framework code from the web page.
Download (Links to an external site.) the SMSSpamCollection dataset from the web 
page.
Update the kDataPath variable and execute StartingPoint1.py in your python 
environment. Make sure you get accuracy reports for the strawman models. Spend 
some time reading through the code. Over the next several weeks we’ll rewrite and 
expand most of it.
 
Hand in:
This assignment is ungraded. There is nothing to hand in. But if you don’t complete it 
you’re going to have a hard time doing the next assignment.

1) Logistic Regression
Take the heuristic spam model as a starting point (that is, match the general API) and implement logistic 
regression.

Recall, loss is the average of:

(-y[i] * math.log(yPredicted[i])) - ((1 - y[i]) * (math.log(1.0-yPredicted[i])))

(e.g. the validation set loss is the sum of that across the validation data, divided by the number of 
validation samples)

Recall, yPredicted is:
1.0 / (1.0 + math.exp(-z))

Where z is:

self.weight0 + sum([exampleFeatures[i] * self.weights[i] for i in range(len(exampleFeatures))])

Use a threshold of 0.5 for classification (if score for a sample after the sigmoid is > 0.5 it is classified as 
spam).

Use gradient descent for optimization.


Recall the gradient for weight j is sum over samples of: ((yPredicted[i] - y[i]) * x[i][j])

[ divide this by the number of samples then multiply by step size for update. ]

HAND IN:
Run your implementation on the training/validation data in the framework and hand in a clear 
document containing the following. Include your logistic regression code as an appendix (no points for 
the code, you can lose points from not including it).

2 Point -- Run for 50,000 iterations with step size 0.01 and produce a graph of the training set loss vs 
iteration every 1000 iterations (This may take a few minutes to run. Machine learning takes some 
patience).


2 Point -- Plot the validation set loss, validation set accuracy, and value of weight[1] (this is the weight 
associated with X_1) after every 10,000 iterations.

1 Point -- Calculate all the statistics from the evaluation framework on the 50,000 iteration run, 
including the confusion matrix, precision, recall, etc.

Answer in no more than 150 words (plus the plots and tables mentioned above):

1 Point -- What do these measurements tell you about logistic regression compared to the straw-man 
algorithms?
1 Point -- How did the gradient descent converge?
1 Point -- What makes you think you implemented logistic regression correctly?

2) Basic Model Evaluation
Finish implementing the EvaluationsStub methods that you will find in the 
EvaluationsStub.py code provided in the sample code.

These are methods of the form: Precision(y, yPredicted) -> float

The full list of evaluations to implement include:

Precision
Recall
False positive rate
False negative rate
Also implement code to visualize the complete confusion matrix (simple ASCII art is 
fine).

You can find the definitions of all of these in the reading from Hulten chapter 19.

Hand in a document containing:

0.5 Points --  Your code. Keep it clear! If the TA can’t easily follow it they will have to 
deduct credit.

05 points -- A table showing the output of all of these evaluation methods for the spam 
domain with the most common model and the spam heuristic model (the two models in 
the starting point I provided).

?
Lecture 2
Reading:
Required: Hulten Chapters 6, 11,12,17, 19 (Finish 19)

3) Feature Engineering
Add bag of words features to your spam domain solution

Support frequency based feature selection, top N
Support mutual information based features selection top N
Tokenize in the simplest way possible (by splitting on whitespace).

Recall:

MutiualInformation(X,Y) = Sum over every value X has and Y has:

            P(has X, has Y) * log_e ( P(has X, has Y) / (P(has X) * P(has Y)) )

And use smoothing when calculating the probabilities:

P(*) = (# observed + 1) / (total samples + 2)

Do all the runs listed below on the train/validation split provided by the framework.

HAND IN:

A document that contains the following tables (clearly labeled!)

1 point -- Perform a leave-out-one wrapper search on each of the 5 features provided 
by the starting framework (50,000 iterations, there are 5 features, you can do this 
evaluation manually or programmatically). (> 40, contains #, contains ‘call’, contains ‘to’, 
contains ‘your’). Hand in a table showing the accuracy on validation set with each 
features left out, compared to a model built with all of the features.

0.5 point -- A list of the top 10 bag of word features selected by filtering by frequency.
0.5 point -- A list of the top 10 bag of word features selected by filtering by mutual 
information.
 

2 points --

0.5 point -- Run gradient descent to 50,000 iterations with the top 10 words by 
frequency.
0.5 point -- Run gradient descent to 50,000 iterations with the top 10 words by mutual 
information.
0.5 point -- Run gradient descent to 50,000 iterations with the better of these PLUS the 
hand crafted features from the framework.
0.5 point -- Run gradient descent to 50,000 iterations of the previous setting with 100 
words plus hand-crafted instead of 10.
Hand in a clearly labeled table comparing the accuracies of these methods

4) ROC Curves and Operating Points

Update model.predict for your logistic regression model so that it takes a threshold as a 
parameter (as opposed to the default we’ve been using so far of 0.5).
1 point — Produce a chart that compares a:
*	logistic regression model with 10 mutual information features
*	logistic regression model with 10 mutual information features and my heuristic 
features
by plotting their precision vs their recall at 100 different thresholds (0.1, 0.2, etc). You 
can use a python plotting library (like mathplotlib) or import data into some other tool 
to produce the plot (like excel).
Get your predictions by training on the training set the framework provides and 
evaluating on the validation set.
1 point — 
Hand in a table that contains the threshold that achieve 10% False Positive Rate on the 
validation set for these two model (10 mutual information features with and without 
heuristics). Include the False Negative Rate that is achieved on the validation set by 
that threshold.

5) Categorizing Mistakes

Looking at and interpreting the mistakes your models make is an important part of 
successful machine learning, particularly when doing feature engineering.
Implement a way to get the raw context for the samples where your model is most-
wrong. Recall this includes examples where the true answer was 1, but the model gives 
very low probabilities, and examples where the true answer was 0, but gives very high 
probabilities.
To do this you will need to have a version of model.predict that returns raw 
probabilities (without using a threshold).
HAND IN:
0.5 point: Produce a list of the 20 (or as many as your model makes) worst false 
positives (where the model was very sure but wrong) made by running logistic 
regression on the initial train/validation split with 10 mutual information features
0.5 points: Produce a list of the 20 worst false negatives (where the model was very 
sure but wrong) made by running logistic regression on the initial train/validation split 
with 10 mutual information features
Look at these mistakes and categories them according to potential causes for the 
mistakes -- the property of the message do you think lead to the model getting the 
wrong answer. Come up with your own categories (reasons for the mistake):
0.5 points: categorize the false positives into at least 4 categories.
0.5 points: categorize the false negatives into at least 4 categories.
1 point: in no more than 150 words describe the insight you got from this process, 
including one new heuristic feature you think would reduce the bad false positives, and 
one that would reduce the bad false negatives.
?
Lecture 3
Reading:
Required:
*	Mitchell Online Addendum 3 (Links to an external site.) (Not chapter 3 in book)
*	Mitchell ch 5 (in book)
Optional:
*	Kohavi 95 (http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf (Links to an 
external site.))

6) Comparing Models
Implement code to estimate the 95% range for your accuracy estimates. In the future 
always include 95% confidence ranges whenever you turn in accuracy estimates.
Recall: 
Upper = Accuracy + 1.96 * sqrt((Accuracy * (1 - Accuracy) / n)) 
Lower = Accuracy - 1.96 * sqrt((Accuracy * (1 - Accuracy) / n))
Evaluate two settings from your feature selection assignment: 10 words selected by 
mutual information and 10 words selected by frequency.
Implement cross validation that supports any number of folds from 2 to n. Verify that it 
is selecting the correct data into each fold. Be prepared to run all current evaluations on 
the result (precision, recall, false positive rate, false negative rate).
Hand in a clearly labeled table with:
* 1 point -- the accuracy estimates from the train/validation split run with error bounds 
for the two models (10 words by frequency and 10 words by mutual information)
* 1 point --  the accuracy estimates from a 5-fold cross validation run for the two model 
variants with error bound (10 words by frequency and 10 words by mutual information, 
cross validation on the data in the train+validation set)
NOTE: if your gradient descent is slow (like mine is) these runs are going to start to take 
a long time. One possibility is to just-not-care — let it run over night or whatever. 
Another easy approach is to do several runs in parallel. You can do this manually or 
programmatically.
You could also choose to optimize your code, but don’t go overboard. That’s not the 
point of this assignment. In practice you should use an existing, highly-optimized 
implementations of machine learning algorithms (not one you write yourself).

7) Build your Best SMS Spam Model
This is a competition assignment.
See README_Test.
Download the kaggle support code  (Links to an external site.)that will load the test data 
and format your submission.
Build the best model you can using any of the learning algorithms you developed so far. 
Do any feature engineering you like.
Submit your answers to our competition (Links to an external site.), check the leader 
board, and iterate according to the submission rules.
5 Points — For submitting an answer that performs better than the TAs baseline 
answer.
3 Points — For submitting an answer that performs better than the TAs tuned answer.
 
5 Points — Hand In:
A report of no more than 1000 words with no more than 5 figures/tables which 
demonstrates that you have produced an effective model for the SMS spam problem 
and properly evaluated your model.
1 Point — Demonstrate several parameter sweeps in figures.
1 Point — Make sure to use the appropriate techniques from the previous lectures 
(error bounds, various measures of model quality, categorized mistakes), describe how 
you used one of them and how it helped.
1 Point — Examine your mistakes and improve your feature engineering in at least 3 
ways. Describe the features your best model uses (you can use insights from your 
previous investigation, but indicate which you used).
1 Point — Include an ROC curve comparing the first model you tried with your final best 
model. Clearly label what they are.
1 Point — Clearly describe your best model and parameter settings you used as well as 
the complete final feature set.
 
NOTE: Trying many parameters can take a lot of time. You can run different parameter 
settings in parallel using joblib (Links to an external site.):
 
from joblib import Parallel, delayed
# Running the parameter evaluations serially
paramsEvals = [ ExpecuteOnParams(params) for params in paramsList ]
# Running the parameter evaluations in parallel
paramsEvals = Parallel(n_jobs=8)(delayed(ExecuteOnParams)(params) for params in 
paramsList)
 
 
# In my implementation params is a dictionary of all the (hyper) parameters to use for 
feature selection
#  and model training, and ExecuteOnParams runs selection and fitting using those 
parameters,
# returning an updated params dictionary that includes the validation set loss from the 
run.
?
Lecture 4

Reading:
*	Required: Hulten 4,5,7,8
*	Mitchell ch 3 (in book)

8) Decision Trees

Download (Links to an external site.) the support framework for the Adult Census 
dataset.
Download (Links to an external site.) the Adult Census dataset.
Implement a decision tree learner:
import DecisionTreeModel 
model = DecisionTreeModel.DecisionTree() 
model.fit(x,y) 
yValidatePredicted = model.predict(xValidate)
Implement (very simple) numeric attributes by measuring the range of the feature 
values at each node then consider splitting at the midpoint. E.g. for our binary 0/1 
features where every value is 0 if the features is absent and 1 if the feature is present, 
consider a split at the midpoint of 0.5.
Recall: if all samples at a node have the same value for a feature, there is no need to 
consider splitting.
Use greedy search (one step look ahead: just consider a single split at a time).
Use InformationGain as the split criteria.
Recall Information Gain is:
H(node_data) - sum_i p(feature has i) * H(node_data where feature has i)
And H is:
sum_y - p(node_data has label y) * log(p(node_data has label y))
 
Implement a single search parameter: minToSplit which stops the tree growth process 
when a new leaf has fewer samples than the limit.
 
Implement a function to visualize the tree:
model.visualize()
feature i:
>= 0.5:
Leaf: <num with label 1> <num with label 0>
< 0.5:
Feature j:
>= 0.5:
Etc…
< 0.5:
Etc…
 
HAND IN:
Run your DecisionTree algorithm on the Adult Census dataset (with the 
train/validate/test split provided in the sample framework) using the features that were 
included with the support framework and with minToSplit = 10000.
1 Point -- Report the accuracy on the validation set. Include an error bound.
1 Point -- Include a visualization of the tree.
Now tune the minToSplit parameter to try to find the best setting.
3 Point -- In no more than 200 words describe the process you used to optimize 
minToSplit, including a a chart showing the minToSplit values you tried on the X axis 
and validation set accuracy on Y axis. How sensitive is the accuracy to the value of 
minToSplit? Which value turned out to be best? Is it significantly better than the initial 
setting (10,000) at a 95% confidence threshold? How do you know.
2 Point -- produce an ROC curve comparing this tree learned with minToSplit = 10,000 
to the tree you learned with your best setting. Clearly label which curve corresponds to 
which model. Describe what you are able to learn from looking at the curves.
NOTE to produce an ROC curve you need to use a threshold for predicting 1/0 based 
on the fraction of samples at the leaf with the predicted label.
 
Upload your code in .py files.
Upload all your answers in a single PDF file.
No archives/zips
?
Lecture 5
Reading:
*	Required: Sections 1,3, & 4 of original random forest 
paper: https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf (Links to an 
external site.)
*	Required: Hulten 9,20

9) Random Forest
Implement random forests, using your existing decision tree model as the base model.
NumTrees <N> & MinSplit <N> — Support parameters for the number of trees and the 
minSplit for the base trees. 
Bagging — Support a flag to use bagging to select data for each tree, when absent you 
should learn each tree on the full original sample. 
FeatureRestriction <N> — If specified randomly select N features for each tree and 
restrict the tree to using those (select a different random set for each tree). If set to 0 
use all available features. 
Seed <N> — Support a random seed so your runs (sampling features and training 
samples) are deterministic. This is purely to help you debug & it will almost certainly 
help.
Recall, bagging means creating a new training set by randomly selecting samples from 
the original training set with replacement. If the original is N samples, select N random 
indexes into the sample array (with replacement) and build a new set using the samples 
at those indexes.
HAND IN:
Use the Adult Census data for these assignments.
Build a model with numTrees = 10, minSplit = 500, use Bagging and FeatureRestriction 
= 75.
2 Points — Create a table that has accuracies of the 10 individual trees, along with the 
accuracy of the full random forest (after the individual trees vote) on xTest.
Run parameter sweeps for numTrees in: [1, 20, 40, 60, 80], for each of these settings:
minSplit = 2, Bagging, and FeatureRestriction = 75 
minSplit of 50, Bagging, and FeatureRestriction = 75 
minSplit = 2, NO Bagging, and FeatureRestriction = 75 
minSplit = 2, Bagging, and NO FeatureRestriction = 0
2 Points — Produce a plot with numTrees on x-axis and the hold-out set accuracy for 
each of these variations on the y-axis.
NOTE -- these runs can be very slow, but the trees can be built in parallel. joblib is a 
python library that makes this easy to do. Keep in mind, if you're doing parallel 
hyperparameter search, you probably won't want to also parallelize tree growing inside 
the random forest algorithm:
# Sequential version: 
self.trees = [ GrowTree(i, x, y, minToSplit, random.randint(0,1000000), logProgress, 
useBagging, restrictFeatures, numFeaturesPerTree) for i in range(numTrees) ]
#Parallel version: 
self.trees = Parallel(n_jobs=6)(delayed(GrowTree)(i, x, y, minToSplit, 
random.randint(0,1000000), logProgress, useBagging, restrictFeatures, 
numFeaturesPerTree) for i in range(numTrees))
?
Lecture 6
Reading:
*	Mitchell 8
*	Hulten 25

10) Computer Vision Features
Download the new data set (Links to an external site.) and support code (Links to an 
external site.). You'll also need to install the Pillow imaging library (Links to an external 
site.) to process image files.
For this assignment add 4 new feature sets to BlinkSupport.Featurize. Build a model 
with random forests tune the parameters (Try at least 3 settings each of: min to split, 
num trees and feature restriction). 
0.5 Point -- Divide the image into a 3 x 3 grid and for each grid location include a 
feature for the min, max, average y-gradient among the locations in the grid. What test-
set accuracy did you achieve? What parameter values were best?
0.5 Point -- Divide the image into a 3 x 3 grid and for each grid location include a 
feature for the min, max, average x-gradient among the locations in the grid. What test-
set accuracy did you achieve? What parameter values were best?
0.5 Point -- Implement a histogram of gradients across the whole image (not on the 
grid) with 5 uniformly spaced bins for the absolute value of the y gradients (0 - 0.2, 0.2 - 
0.4, etc). For each bin create a feature whose value is the percent of y-gradients that fall 
in the bin. What test-set accuracy did you achieve? What parameter values were best?
0.5 Point -- Implement a histogram of gradients across the whole image (not on the 
grid) with 5 uniformly spaced bins for the absolute value of the x gradients (0 - 0.2, 0.2 - 
0.4, etc). For each bin create a feature whose value is the percent of x-gradients that fall 
in the bin. What test-set accuracy did you achieve? What parameter values were best?
1 Point -- Produce an ROC curve with one curve for: the y-gradients on the 3x3 grid; 
the x-gradients on the 3x3 grid; the y-gradient histogram; the x-gradient histogram. Use 
the tuning values you found in the previous parts of this assignment.

11) k-Means Clustering

Implement kMeans clustering.
Select random samples to use as initial cluster centroids.
For each iteration:
 
Assign each training sample to the nearest cluster centroid 
Update each centroid location to the avg location of assigned samples
 
Use Euclidean distance with L2 norm. Recall this is:
Sqrt( sum for each dimension d: (sample_d - centroid_d)^2 )
Recall the new centroid location for dimension d is:
New_d = ( sum for each assigned sample: sample_d ) / ( # assigned samples)
 
Support a parameter to specify how many clusters to learn.
Support a parameter to specify how many iterations to run for.
 
Hand IN:
Run a clustering on the training set of the eye blink dataset (xTrain) with the two 
features provided in the support code (avg y gradient, and avg y gradient mid image). 
Use 4 clusters for 10 iterations.
1 Point -- Produce a plot showing the training data and overlaying the paths the cluster 
centroids take for each of the 10 iterations.
1 Point -- Find the closest training sample to the final location of each cluster center 
and the associated image.
1 Point -- In no more than 150 words describe the clustering process. Did the clustering 
converge? What do you learn from examining the images nearest the cluster centers?

12) K Nearest Neighbors
Implement k nearest neighbors where each test sample is classified based on its k 
nearest neighbors among the training set; the test sample's score is the proportion of 
the k neighbors whose label is 1.
Use the blink dataset with the features you implemented in the previous assignment 
(3x3 grid of x-gradient & y-gradient and histograms across the whole image).
2 Points -- Evaluate k in [1,3,5,10,20,50,100]. Produce an ROC curve comparing each of 
these approaches (fit the model on xTrain, evaluate on xValidate).
1 Point -- in no more than 100 words describe the results. Which k is best?
?
Lecture 7
Reading: 
*	Mitchell 4
*	Hulten 10, 21

13) Neural Networks
Implement learning of fully connected neural networks with an input layer, N hidden 
layers (and M nodes per hidden layer), and an output layer (with one output variable) 
using the Backpropagation algorithm as described in Mitchell with stochastic gradient 
descent.
The implementation should be roughly:
For each iteration: 
For each training sample: 
Pass the sample through the network to get activations 
Propagate error from output layer back through the network 
Update all the weights
[ Hint: have one structure for all the weights in the network, a parallel one for 
activations, and a parallel one for errors ]
ANSWER
Use BlinkSupport’s Featurize with the following arguments: 
includeGradients=False 
includeRawPixels=False 
includeIntensities =True
This results in features where each image is scaled down by a factor of 2 and each pixel 
intensity is converted to the range of 0.0-1.0.
Train models with every combination of hidden layer in [1, 2] and hidden nodes per 
layer in [ 2, 5, 10, 15, 20 ]. For each, use 200 iterations with step size=0.05. NOTE: This 
will take a while to complete, you might want to verify all your parameters are working 
on smaller experiments before kicking off the full run.
Produce a plot with one line for each of these run with the iteration number on the x 
axis and training set loss on the y axis.
Use Equation 4.2 in Mitchell for loss (Mean Squared Error): E() = 1/2 sum_sample (y^-
y)^2 where E is the error on a single sample, and MSE is the average of this across the 
whole training set.
Produce a separate plot with one line for each of these runs but with the validation set 
losses on the y axis.
Next take the model with 1 hidden layer and 2 nodes in that layer. Visualize the weights 
for each of the hidden nodes. There should be 12*12 weights (plus one for the bias, 
ignore it). Convert them to a 12 x 12 image where each pixel intensity is ~ 255 * 
abs(weight).
Note: the VisualizeWeights function in BlinkSupport (Links to an external site.) can do 
this for you.
 
Hand in a writeup including:
 
*	1 Point -- the two charts (all your training runs showing the train and validation loss)
*	1 Point -- the two visualizations (the weights for two hidden nodes)
*	1 Point -- the best parameters and test-set accuracy you found in your parameter 
sweep
*	And in no more than 150 words (and 1 more chart / figure if it will help) answer 
these questions:
o	1 Point -- Did you observe overfitting and underfitting? Where?
o	1 Point -- What do the visualizations mean?

14) Warm up Model Tuning (Don’t go to far, just warm up, you’ll do 
more tuning with more powerful approaches in the Kaggle assignment)

Now tune neural networks to produce the best model you can.
Use just the training data (xTrain) and validation data (xValidate) to tune your 
parameters (Using xValidate as holdout data, or by combining xTrain and xValidate and 
doing cross-validation); reserve the test set (xTest) for final accuracy measurement.
1 Point -- Change the features in at least one way [ increase or decrease the image 
scaling provided by the sample, change normalization, or add momentum to your back 
propagation. ]. Include a table showing the results and a few sentences describing if and 
how it helped.
2 Point -- Use 2-3 tables and not more than 200 words to describe the parameter 
tuning you did. Describe one place where you examined the output of the modeling 
process and used the insight to improve your modeling process. What was this output? 
What change did you make because of it?
1 Point -- Include an ROC curve comparing the best random forest model you got on 
hand-crafted features (last assignment); your initial Neural Network (before any tuning); 
and your final resulting network (after changing a feature and tuning).
?
Lecture 8
Reading:
*	Hulten ch 13,14,15
*	The ResNet Paper (Links to an external site.) (You may not follow it all, that's fine)

15) Build your best Eye Blink Model

This is a Kaggle Competition Assignment, see 
here: https://www.kaggle.com/c/csep546-aut19-kc2/overview (Links to an external 
site.)
Redownload the BlinkSupport.zip (Links to an external site.) code. You'll find:
*	BlinkKaggleSupport.py -- basic help for loading/submitting
*	Pytorch Set up on VS.txt -- a quick walkthrough on getting set up with visual studio, 
should help with other environments too
*	BestEyeBlinkModel.ipynb -- a sample of how to set up on Google's Colab service, 
where you can get free GPU access. Open the file in Google Colab. If you use this, 
send Andrew a thank you note.
You can also make your own way, starting here: https://pytorch.org/get-
started/locally/ (Links to an external site.)
 
Build the best model you can using any of the learning algorithms you developed so far 
(but probably neural networks). Do any feature and network engineering you like. But 
consider updating the starter model in SimpleBlinkNeuralNet.py based roughly on 
LeNet-5 and then tuning from there.
 
5 Points — For submitting an answer that performs better than the first baseline answer 
(which will be submitted by the TA).
5 Points — For submitting an answer that performs better  than the second baseline 
(which will be enabled by the TA after a week).
Hand In:
A report of no more than 1000 words with no more than 5 figures/tables which 
demonstrates that you have produced an effective model for the Blink problem and 
properly evaluated your model.
1 Point — Demonstrate several parameter sweeps in your figures and highlight at least 
one area that indicates underfitting and one that indicates overfitting.
1 Point — Make sure to use the appropriate techniques from the previous lectures 
(error bounds, various measures of model quality, categorized mistakes). Demonstrate 
you used them in your modeling.
1 Point — Examine your mistakes and improve your feature/network engineering in at 
least 3 ways. Describe the features you tried and how they affected accuracy.
Use whatever you need to to produce an accurate model, but consider:
*	torch.nn.Conv2d
*	torch.nn.ReLu
*	torch.nn.MaxPool2d
*	torch.nn.BatchNorm2d
*	torch.nn.Dropout
*	Tuning the number of filters/nodes, the optimizer, and the training iterations
*	Data augmentation (e.g. mirroring all the training data: make a left eye look like a 
right eye and vice versa)
1 Point — Include an ROC curve comparing the first model you tried with your final best 
model. Clearly label what they are.
1 Point — Clearly describe your best model and parameter settings you used as well as 
the complete final feature set. Included the updated neural network code so we can see 
the model being built and the forward pass.
 
?
Lecture 9
Reading:
*	Mitchell ch 13
*	Watch AlphaZero 
talk: https://www.youtube.com/watch?v=Wujy7OzvdJk&t=2057s
*	Hulten ch 22,23,24

16) Reinforcement Learning
Setup:
Install the gym toolkit for evaluating reinforcement learning 
algorithms https://gym.openai.com/docs/ (Links to an external site.)
Download the support code (Links to an external site.).
Implement Q-learning as described in Mitchell chapter 13. Represent Q^ with tables.
Use formulas 13.10 and 13.11 to update Q^.
Support an experimentation strategy that:
*	Uses the formula in section 13.3.5 to decide which action to take P(a_i | s). Support 
the parameter k to modify this expression (k=e is a good start; also consider values 
in the range 1.01 - 1.5).
*	Add an additional parameter called randomActionRate, which overrides and takes a 
totally random action (e.g. if this is 0.02 then take a random exploration action 2% 
of the time).
*	Also support a ‘learningMode=False’ option that unconditionally takes the best 
action so you can run your best policy against the simulator and see how well you've 
learned.
 
In summary, you should support these parameters:
*	discountRate = 1.0 # Controls the discount rate for future rewards -- this is gamma 
from 13.10
*	randomActionRate = 0.01 # Percent of time the next action selected by GetAction is 
totally random
*	actionProbabilityBase = 2.7 # This is k from the P(a_i|s) expression from section 
13.3.5 and influences how random exploration is
*	learningRateScale = 0.01 # Should be multiplied by visits_n from 13.11
 
Use StartingPoint-QLearningCartPole.py.
3 Points -- Run the starting point code 10 times with your implementation (learn the 
policy then evaluate it as in the starting point). Hand in a table of the scores you 
achieved on each of the 10 iterations as well as the average. [ Many runs should score 
200.0 ]
 
Now move on to StartingPoint-QLearningMountainCar.py file. If your implementation 
is correct then the default parameters should come close to getting the cart up the hill 
(but not really succeed).
 
4 Points (1 point per parameter) --
Tune the following 4 parameters:
*	discountRate
*	actionProbabilityBase
*	trainingIterations
*	GymSupport.mountainCarBinsPerDimension
For each produce a chart with at least 5 settings of the parameter value on the x-axis 
and the average score across 10 policy learning runs on the y axis (10x learn a policy 
then evaluate it as the sample code does, get the average score). Consider the 
properties of this problem and use your understanding to guide which regions you 
explore.
For each parameter include 2-3 sentences about what the results of the tuning tell you 
about the problem.
 
2 Points -- Produce an improved parameter setting. You may change the 4 you tuned 
and you may change any other that you think matters (and do additional tuning).
 
Run your new parameter settings 10 times (learn the policy then evaluate it as in the 
starting point). Hand in a table of the scores you achieved on each of the 10 policy 
learning runs as well as the average.
?
Lecture 10
Reading:
Mitchell ch 6
Mitchell online addendum (not in book) ch 14 (Links to an external site.)
Hulten ch 18, 26