Project 5: Classification

Version 1.004. Last Updated: 11/19/2018.

Table of Contents

Which Digit?


In this project, you will design a classifier (a perceptron classifier). You will test the classifier on a set of scanned handwritten digit images. Even with simple features, your classifier will be able to do quite well on these tasks when given enough training data.

Optical character recognition (OCR) is the task of extracting text from image sources. The data set on which you will run your classifiers is a collection of handwritten numerical digits (0-9). This is a very commercially useful technology, similar to the technique used by the US post office to route mail by zip codes. There are systems that can perform with over 99% classification accuracy (see LeNet-5 for an example system in action).

The code for this project includes the following files and data, available as a zip file.

Data file Data file, including the digit and face data.
Files you will edit The location where you will write your perceptron classifier. The wrapper code that will call your classifiers. You will also write your enhanced feature extractor here. You will also use this code to analyze the behavior of your classifier. Answer to Question 2 goes here.
Files you should read but NOT edit Abstract super class for the classifiers you will write.
(You should read this file carefully to see how the infrastructure is set up.) I/O code to read in the classification data. Code defining some useful tools. You may be familiar with some of these by now, and they will save you a lot of time. A simple baseline classifier that just labels every instance as the most frequent class.

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's judgements -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the discussion forum are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

Discussion: Please be careful not to post spoilers.

Question 1 (4 points): Perceptron

A skeleton implementation of a perceptron classifier is provided for you in In this part, you will fill in the train function.

Unlike the naive Bayes classifier, a perceptron does not use probabilities to make its decisions. Instead, it keeps a weight vector \(w^y\) of each class \(y\) ( \(y\) is an identifier, not an exponent). Given a feature list \(f\), the perceptron computes the class \(y\) whose weight vector is most similar to the input vector \(f\). Formally, given a feature vector \(f\) (in our case, a map from pixel locations to indicators of whether they are on), we score each class with:

\( \mbox{score}(f,y) = \sum\limits_{i} f_i w^y_i \)

[Note, if you don't see a mathematical expression just above this line, make sure your browser is permitted to load JavaScript libraries from 3rd party sites such as]

Then we choose the class with highest score as the predicted label for that data instance. In the code, we will represent \(w^y\) as a Counter.

Learning weights

In the basic multi-class perceptron, we scan over the data, one instance at a time. When we come to an instance \((f, y)\), we find the label with highest score:

\( y' = arg \max\limits_{y''} score(f,y'') \)

We compare \(y'\) to the true label \(y\). If \(y' = y\), we've gotten the instance correct, and we do nothing. Otherwise, we guessed \(y'\) but we should have guessed \(y\). That means that \(w^y\) should have scored \(f\) higher, and \(w^{y'}\) should have scored \(f\) lower, in order to prevent this error in the future. We update these two weight vectors accordingly:

\( w^y = w^y + f \)

\( w^{y'} = w^{y'} - f \)

Using the addition, subtraction, and multiplication functionality of the Counter class in, the perceptron updates should be relatively easy to code. Certain implementation issues have been taken care of for you in, such as handling iterations over the training data and ordering the update trials. Furthermore, the code sets up the weights data structure for you. Each legal label needs its own Counter full of weights.


Fill in the train method in Run your code with:

python -c perceptron 

Hints and observations:

Question 2 (1 point): Perceptron Analysis

Visualizing weights

Perceptron classifiers, and other discriminative methods, are often criticized because the parameters they learn are hard to interpret. To see a demonstration of this issue, we can write a function to find features that are characteristic of one class. (Note that, because of the way perceptrons are trained, it is not as crucial to find odds ratios.)


Fill in findHighWeightFeatures(self, label) in It should return a list of the 100 features with highest weight for that label. You can display the 100 pixels with the largest weights using the command:

python -c perceptron -w  

Use this command to look at the weights, and answer the following question. Which of the following sequence of weights is most representative of the perceptron?


Answer the question in the method q2, returning either 'a' or 'b'.

Question 3 (6 points): Digit Feature Design

Building classifiers is only a small part of getting a good system working for a task. Indeed, the main difference between a good classification system and a bad one is usually not the classifier itself (e.g. perceptron vs. naive Bayes), but rather the quality of the features used. So far, we have used the simplest possible features: the identity of each pixel (being on/off).

To increase your classifier's accuracy further, you will need to extract more useful features from the data. The EnhancedFeatureExtractorDigit in is your new playground. When analyzing your classifiers' results, you should look at some of your errors and look for characteristics of the input that would give the classifier useful information about the label. You can add code to the analysis function in to inspect what your classifier is doing. For instance in the digit data, consider the number of separate, connected regions of white pixels, which varies by digit type. 1, 2, 3, 5, 7 tend to have one contiguous region of white space while the loops in 6, 8, 9 create more. The number of white regions in a 4 depends on the writer. This is an example of a feature that is not directly available to the classifier from the per-pixel information. If your feature extractor adds new features that encode these properties, the classifier will be able exploit them. Note that some features may require non-trivial computation to extract, so write efficient and correct code.

Note: You will be working with digits, so make sure you are using DIGIT_DATUM_WIDTH and DIGIT_DATUM_HEIGHT, instead of FACE_DATUM_WIDTH and FACE_DATUM_HEIGHT.


Add new binary features for the digit dataset in the EnhancedFeatureExtractorDigit function. Note that you can encode a feature which takes 3 values [1,2,3] by using 3 binary features, of which only one is on at the time, to indicate which of the three possibilities you have. In theory, features aren't conditionally independent as naive Bayes requires, but your classifier can still work well in practice. We will test your classifier with the following command:

python -d digits -c naiveBayes -f -a -t 1000  

With the basic features (without the -f option), your optimal choice of smoothing parameter should yield 82% on the validation set with a test performance of 78%. You will receive 3 points for implementing new feature(s) which yield any improvement at all. You will receive 3 additional points if your new feature(s) give you a test performance greater than or equal to 84% with the above command.

Congratulations! You're finished with the CSE 473 programming projects.


Using Canvas, submit your versions of the following files:,,

Updates and corrections:

Any updates or corrections to this project, if needed, will be posted here and in Piazza.