MNIST Dataset¶
Now that you’ve successfully implemented k-means clustering and ran it on 2-dimensional data, we will now use it on a clustering of 784-dimensional data points with 12 centroids to identify hand-written letters. We will apply the algorithm to a real world data set known as the Extended Modified National Institute of Standards and Technology database (EMNIST), which contains images of hand-written letters. For this assignment, we are only using a small subset of the dataset. Here are eight example images:
Each of these images is 28 pixels by 28 pixels, for a total of 784 pixels. Each image can be thought of as a 784-dimensional data point, represented as a list containing 784 numbers. For the images shown above, many of the numbers would be zero since the image is mostly white.
Problem 0: Running with the MNIST dataset¶
To load the MNIST dataset instead of the 2D points, change the code at the bottom of the kmeans.py
to the following:
if __name__ == "__main__":
dataset = "mnist"
data, label = read_data("data/" + dataset + ".csv")
init_c = load_centroids("data/" + dataset + "_init_centroids.csv")
final_c = main(data, init_c, dataset)
write_centroids_with_key(dataset + "_final_centroids.csv", final_c)
Info
(Only the variable dataset
should change, from the string “2d” to “mnist”.)
You should leave the rest of your code un-changed and it should run with the new data without errors. Run this command again (it may take a minute to run on this data set): python kmeans.py
You should see the following output:
K-means converged after 13 steps.
If you see a different number of steps, or it takes longer than a minute or two there may be an error in your code.
In the newly-created results/MNIST
folder, there should be init
and final
folders filled with centroids plotted as images. (If you notice that one of the centroids is missing, that’s okay! We’ll talk about this more in the next section.)
In the init
folder, the centroids should look like random pixels (below, left), because we initially created them with random values. In the final
folder, they should look roughly like actual letters (right below) after similar data points (images) have been clustered together.
This shows that K-means has adjusted the locations of the centroids from our random initialization to a more meaningful space that corresponds to where the MNIST data is distributed.
If you were to plot each step of the algorithm between the init
and final
stages, you would see a progression from a blurry blob into something that looks like an actual character. (If the animation is stuck on “step26” even after clicking the “Restart Animation Button”, you may need to refresh-skipping-cache in your browser to restart the animation. Hold shift and click on the refresh button to do so.)
For example, in this animation, the centroid is a random distribution of pixels that slowly converges on something approximating an ‘S’. Note that in this case, many of the intermediate steps make no change from each to the next. Since there are 11 other centroids, many of the others are still moving and changing before the algorithm converges.
Load the final centroids: what happened?¶
We have provided some starter code for you near the bottom of analysis.py:
if __name__ == "__main__":
centroids = load_centroids("mnist_final_centroids.csv", with_key=True)
# Consider exploring the centroids data here
data, label = read_data("data/mnist.csv")
print(accuracy(data, label, centroids))
Inspect the centroids dictionary that was generated by adding print statements into the code after the call to load_centroids
. You should also look through the results/MNIST/final
folder and look at the generated images.
How many centroids are there in the final dictionary? In Part 1, we initialized the 2D algorithm with 2 randomly chosen points as the initial centroids and they both remained when the algorithm converges after 7 steps.
But what if we had used different starting centroids? In the 2D dataset from Part 1, there’s nothing particularly special about (0.332..., 0.179...)
as a starting centroid. In fact, we could have started with any random point in the 2D plane. This is similarly true for the starting centroids we gave you for the MNIST data. The actual data set is 784 pixels, each varying from 0 to 255. Organized in a 28x28 grid and rendered into an image, most of them look like a handwritten letter. But if you look at the initial centroids (in data/mnist_init_centroids.csv
), they only look like random noise: those 784 pixels were randomly generated.
If you look at the data for the initial MNIST centroids, in the data/mnist_init_centroids.csv
file, you should see that there are 12 lines, each with 784 numbers. Each of these lines represents one of the starting centroids. Inspecting the final centroids dictionary, or counting the number of images in the results/MNIST/final
folder, you should see that there are only 11 centroids.
Problem 1: What happened to one of the centroids?¶
Leave your answer to this question in the appropriate place in the answers.txt
file.
Why are there fewer centroids? What do you suspect happened?
Problem 2: Assigning labels to Centroids¶
When we implemented the assign_points_to_centroids()
function in kmeans.py
, it assigned each data point to the centroid it is closest to. This works great for our k-means clustering, but now we are not only interested in which centroid the data point is assigned to, but also the true label of that data point in order to perform the accuracy analysis.
If you look at the data/mnist.csv
file, you will see that each data point has a letter as the first character. For example, the first data point is:
M,0,0,0,0,...,218,251,254,254,218,79,3,1,22,127,208,...,0,0,0,0,0,0,0
In this data set, each point represents a single handwritten letter. The goal of the k-means algorithm is to group similar data points together, in the hopes that each group represents a single letter. In other words, we want to be able to say that, given a handwritten letter, we can predict what letter it is by looking at the centroid it is closest to. (If all of the Ms are closest to centroid1, then it’s a good guess that a newly discovered handwritten letter that’s also closest to centroid1 is an M, too.)
To help us determine how accurate those groupings are, the MNIST dataset also includes the true label for all of the data points. By assigning labels to the centroids, we can calculate how accurate our groupings are.
In analysis.py
, you should find a function named assign_labels()
. This is a similar function to assign_points_to_centroids()
but instead of assigning points to centroids, it assigns the actual labels of the data point. For example, if the points assigned to centroid1 (aka, the points closest to centroid1) are assigned labels ‘M’ and ‘N’, then the labels for centroid1 would similarly be ‘M’ and ‘N’.
def assign_labels(list_of_points, labels, centroids_dict):
"""
Assign all data points to the closest centroids and keep track of their
labels. The i-th point in "data" corresponds to the i-th label in "labels".
Arguments:
list_of_points: a list of lists representing all data points
labels: a list of strings representing all data labels
labels[i] is the label of the point list_of_points[i]
centroids_dict: a dictionary representing the centroids where the keys
are strings (centroid names) and the values are lists
of centroid locations
Returns: a new dictionary whose keys are the centroids' key names and
values are a list of labels of the data points that are assigned
to that centroid.
Example:
Code:
list_of_points = [[1.1, 1, 1, 0.5], [4, 3.14, 2, 1], [0, 0, 0, 0]]
labels = ['M', 'W', 'N']
centroids_dict = {"centroid1": [1, 1, 1, 1],
"centroid2": [2, 2, 2, 2]}
print(assign_labels(list_of_points, labels, centroids_dict))
Output:
{'centroid1': ['M', 'N'], 'centroid2': ['W']}
"""
Instead of a dictionary whose values are lists of lists, the dictionary returned from assign_labels
should be a dictionary whose values are lists of strings.
Just like kmeans.py
, we have provided tests for you to verify your implementation.
Run python test_analysis.py
in the terminal and you should see the following output after you have correctly implemented assign_labels()
:
test_assign_labels passed.
Exception: majority_count() returned None; have you implemented it yet?
As with kmeans.py
in Part 1, this means you can continue to the next part of the assignment.
Problem 3: Calculate the count of the majority label¶
Implement majority_count()
to return the count of the majority label. This function should take in a list of labels, determine which label appears most frequently, and return the count of that label.
def majority_count(labels):
"""
Return the count of the majority labels in the label list
Arguments:
labels: a list of labels
Returns: the count of the majority labels in the list
Example:
Given labels = ['M', 'M', 'N']
majority_count(labels) returns 2
"""
python test_analysis.py
after you have correctly implemented majority_count()
: test_assign_labels passed
test_majority_count passed
Exception: accuracy() returned None; have you implemented it yet?
Analyzing the Performance¶
Info
The images in the MNIST data set have been labeled with the letter they actually represent — you can see this label as the letter in the first column on each row in mnist.csv
. Each row of that file represents one data point.
One way to evaluate our algorithm would be to examine the accuracy of the classification of the images from MNIST. For example, how many of the images labeled as the letter ‘K’ ended up closest to the same centroid?
First, let’s define the “classifications” of our data points provided by k-means clustering by taking a majority-takes-all approach: we assign each cluster a label as the majority labels of all data points in this cluster. For example, suppose we have data points with these labels assigned to centroid1: Q, Q, Q, Q, Q, Q, O, P, Q, Q, O, P, P, Q, Q
.
Since there are 10 Qs, 3 Ps, and 2 Os in this cluster, we say the label of centroid1
is Q since it is the label that appears the most frequently.
We can then define the accuracy of the classification for a centroid as:
In our example, the majority label is Q, there are 10 label Q’s, and there are 15 labels in total. So we say the accuracy for centroid1 is:
To determine the overall accuracy of our algorithm, we’ll determine the majority label for each cluster of points, add those counts together, and then divide that by the total number of labels.
The accuracy of the algorithm is then defined as:
Problem 4: Calculate the accuracy of the algorithm¶
Use the two functions you implemented in the previous parts of Part 2 to calculate the accuracy for every cluster and the whole algorithm, defined previously in the Analyzing the Performance section.
In analysis.py
, use assign_labels()
and majority_count()
to implement accuracy()
to calculate the accuracy of our k-means clustering algorithm:
def accuracy(list_of_points, labels, centroids_dict):
"""
Calculate the accuracy of the algorithm. You should use assign_labels and majority_count (that
you previously implemented)
Arguments:
list_of_points: a list of lists representing all data points
labels: a list of ints representing all data labels
labels[i] is the label of the point list_of_points[i]
centroids_dict: a dictionary representing the centroids where the keys
are strings (centroid names) and the values are lists
of centroid locations
Returns: a float representing the accuracy of the algorithm
"""
You should see the following output from running test_analysis.py
once you have correctly implemented accuracy()
:
test_accuracy passed.
At this point, if you completed all parts to analysis.py
you should also see following output when you run the tests:
all tests passed.
Run analysis and think about the results¶
You have now completed all the parts required to run the analysis. The code at the end of the file will help to print the accuracy of the algorithm for this sample:
if __name__ == "__main__":
centroids = load_centroids("mnist_final_centroids.csv", with_key=True)
# Consider exploring the centroids data here
data, label = read_data("data/mnist.csv")
print(accuracy(data, label, centroids))
Run the program by typing python analysis.py
.
Note that the file name of the images in the results folder does not necessarily correspond with the label assigned to the character (e.g., centroid0.png is not necessarily an image of ‘0’). Instead, try to figure out the label of the image by looking at the image itself. What is the output? Is this value surprising? It might be helpful to take another look at the final centroid images from MNIST in the results/MNIST/final
folder.
Problem 5: Analyze the results¶
Leave your answer in the appropriate place in the answers.txt
file.
In your opinion, based on these centroids, which letters are easier to be distinguished by the algorithm, and which are harder? Reasonable answers will receive full credit. (i.e, 2-3 sentences).
You may find it helpful to add some print statements to the accuracy
function. Although we ask for the accuracy of the algorithm overall, it might be illustrative to print the accuracy for each centroid (refer back to the Analyzing the Performance section, which discusses calculating the accuracy for each centroid). You may find it additionally helpful to (temporarily) modify the call to majority_count
to print out the majority label for each centroid.
Caution
Before submitting your code to Gradescope, make sure to remove any print statements you added.
Quality¶
Info
Make sure to provide descriptive comments for each function in docstring format
Warning
If you discuss this assignment with one or more classmates, you must specify with whom you collaborated in the header comment in your submission. Otherwise, you may be guilty of academic misconduct and face consequences. Note that you may not collaborate in a way that is prohibited, even if you cite the collaboration.
Your assignment should pass two checks: flake8
and our code quality guidelines. The code quality guidelines are very thorough.
Submission¶
Submit analysis.py
, kmeans.py
, and answers.txt
via Gradescope for part 2.