Inside your homework4 folder you’ll find a number of folders and files. Here is a quick breakdown of each of them:

folders:

  • data: a directory containing the data your program needs
  • results: a directory where your plots will be automatically saved. (This folder will be created automatically when you run the complete kmeans.py file at the end of Part 1.)

files:

  • kmeans.py: where you’ll write your k-means clustering algorithm for part 1
  • analysis.py: where you’ll do your analysis for part 2
  • answers.txt: where you’ll write your answers for questions in part 2
  • test_kmeans.py: a program to test kmeans.py
  • test_analysis.py: code to test analysis.py
  • utils.py: helper code for the assignment

Caution

Do not modify anything in utils.py. The file is provided to help you complete your assignment.

Note that the results folder will be created automatically when you run your completed kmeans.py.


Part 1: Implementing the K-Means Algorithm

In Part 1, you will design and implement the core k-means clustering algorithm.

For this part, you only need to modify the kmeans.py file.

Data Structures Overview

Below is an overview of the data structures data and centroids, which you will need to work with in order to implement the algorithm:

data: A list of data points, where each data point is a list of floats. For example:

data = [[0.34, -0.2, 1.13, 4.3], 
        [5.1, -12.6, -7.0, 1.9], 
        [-15.7, 0.06, -7.1, 11.2]]

Here, data contains three data points. Each data point is four-dimensional. In this example, point 1 is [0.34, -0.2, 1.13, 4.3], point 2 is [5.1, -12.6, -7.0, 1.9] and point 3 is [-15.7, 0.06, -7.1, 11.2].

centroids: A dictionary of centroids, where the keys are strings (centroid names) and the values are lists (centroid locations). For example:

centroids = {
    "centroid1": [1.1, 0.2, -3.1, -0.4], 
    "centroid2": [9.3, 6.1, -4.7, 0.18]
}

Here, centroids contain two key-value pairs. You should NOT change the names of the centroids when you later update their locations.

Info

When you first run the tests (before you implement anything) you will see an error about missing implementations. As you work your way through the assignment, you’ll start to see fewer errors as the tests pass. Seeing assertion errors for parts that you haven’t implemented yet is normal and expected.

We have provided several test functions that you should use to check your implementation as you go through the assignment. To run the tests, open the terminal and run the command:

python test_kmeans.py

You will only be working in kmeans.py for Part 1. Here is a brief overview of each step in Part 1:

  1. Implement euclidean_distance(), which calculates the distance between two points.
  2. Implement get_closest_centroid(), which allows you to find the closest centroid for a point.
  3. Implement assign_points_to_centroids(), which sets the centroid assignment for a point.
  4. Implement mean_of_points(), which calculates the mean of a list of points, and update_centroids(), which updates all of the centroids to the new mean of the points assigned to them.

Problem 1: Calculating Euclidean Distance

Implement euclidean_distance() using the Euclidean distance formula to calculate the distance between two points. For example, given

point1 = [1.1, 1, 1, 0.5]
point2 = [4, 3.14, 2, 1]

Then euclidean_distance(point1, point2) should return:

3.7735394525564456

(Your exact precision may vary.)

Likewise, given

point1 = [2, 2]
point2 = [-2, -1]

Then euclidean_distance(point1, point2) should return:

5.0

Refer to the distance section in the Background for more information on how Euclidean distance is calculated.

After implementing this function, you should run the tests for kmeans.py by running the following command in the terminal:

python test_kmeans.py

You should see the following output once you have correctly implemented euclidean_distance():

test_euclidean_distance passed.

If you receive an error that looks like

Exception: euclidean_distance() returned None; have you implemented it?

You should check your implementation of euclidean_distance() to make sure it returns a valid value. The exception message will change to reflect the next un-implemented function as you progress through the assignment.

Problem 2: Finding the Closest Centroid

Implement get_closest_centroid() to find the closest centroid for a data point. For example, given

point = [0, 0, 0, 0]
centroids_dict = {"centroid1": [1, 1, 1, 1],
                  "centroid2": [2, 2, 2, 2]}

Then get_closest_centroid(point, centroids_dict) should return

"centroid1"

In order to find the closest centroid, you’ll have to compare the distance from point to every centroid in centroids_dict

What Python concept is useful here?

You should see the following output from test_kmeans.py once you have correctly implemented get_closest_centroid():

test_closest_centroid passed.

As with the Euclidean distance function, you will also see an error message for the next test, as you have not implemented those functions yet.

Problem 3: Assigning Points to Centroids

Implement assign_points_to_centroids() to assign data points to their closest centroid. For example, given a list of points and centroids dictionary:

list_of_points = [[1.1, 1, 1, 0.5], 
                  [4, 3.14, 2, 1], 
                  [0, 0, 0, 0]]
centroids_dict = {"centroid1": [1, 1, 1, 1],
                  "centroid2": [2, 2, 2, 2]}

Then assign_points_to_centroids(list_of_points, centroids_dict) should return:

{'centroid1': [[1.1, 1, 1, 0.5], [0, 0, 0, 0]],
 'centroid2': [[4, 3.14, 2, 1]]}

If there are no data points assigned to a centroid, then you should not include that centroid in the returned dictionary.

You should see the following output from running test_kmeans.py once you have correctly implemented assign_points_to_centroids():

test_assign_points_to_centroids passed.

Problem 4: Updating Centroids

Do not modify list_of_points or hard-code the dimensionality of the data. Your code should be able to run on data points with any dimension.

Implement mean_of_points(list_of_points) to find the average of a cluster. For example, given

list_of_points = [[1.1, 1, 1, 0.5], 
                  [4, 3.14, 2, 1], 
                  [0, 0, 0, 0]]

Then mean_of_points(list_of_points) should return:

[1.7, 1.3800000000000001, 1.0, 0.5]

Then, implement the update_centroids(assignment_dict) to update the centroid to be the average of the clusters.

For example, given an assignment dictionary whose keys are centroid names and values are lists of points, say:

assignment_dict = {
    'centroid1': [[1.1, 1, 1, 0.5], [0, 0, 0, 0]],
    'centroid2': [[4, 3.14, 2, 1]]
}

Then update_centroids(assignment_dict) should return

{
    'centroid1': [0.55, 0.5, 0.5, 0.25],
    'centroid2': [4.0, 3.14, 2.0, 1.0]
}

You should see the following output from test_kmeans.py once you have correctly implemented mean_of_points() and update_centroids():

test_mean_of_points passed.
test_update_centroids passed.

At this point, if you completed all parts to the k-means clustering algorithm you should see following output when you run the test_kmeans.py tests:

test_euclidean_distance passed.
test_get_closest_centroid passed.
test_assign_points_to_centroids passed.
test_mean_of_points passed.
test_update_centroids passed.
all tests passed.

Running Your Algorithm on 2D Data

Now that you have completed all parts needed to run your k-means clustering algorithm, we can now run it on actual data. At the bottom of kmeans.py you should see code that will do exactly that. This code loads the 2-dimensional data points and the initial centroids we provided in the respective .csv files. After that, the main() function runs the k-means algorithm and write_centroids_tofile() saves the final centroids to a file.

Note that if you wish to edit the contents of the .csv files in the data folder for testing purposes, you can right-click the file in your JupyterHub file browser and select “Open With” > “Editor”. Be sure to revert any changes to the contents of these files before running any provided tests!

You do not need to modify anything in this block of code for Part 1, but you should make note of its existence as you’ll make a change in Part 2.

if __name__ == "__main__":
    dataset = "2d"

    data, label = read_data("data/" + dataset + ".csv")
    init_c = load_centroids("data/" + dataset + "_init_centroids.csv")
    final_c = main(data, init_c, dataset)
    write_centroids_with_key(dataset + "final_centroids.csv", final_c)

Now run your program by typing python kmeans.py in the terminal. If your code works as expected, you should see:

K-means converged after 7 steps.

In addition, there should be 7 plot images generated in a newly-created results/2D/ folder. If you examine these 7 plots, they should be identical to the steps shown in the the gif shown in the background section of this spec.

Clustering Handwritten Letters

Now that you’ve successfully implemented k-means clustering and run it on 2-dimensional data, we will now use your algorithm to cluster 784-dimensional data points with 12 centroids to identify handwritten letters. We will apply the algorithm to a real world data set known as the Extended Modified National Institute of Standards and Technology database (EMNIST), which contains images of handwritten letters. For this assignment, we are only using a small subset of the dataset. Here are eight example images:

Examples of 28x28 pixel grids for the handwritten letters W, G, P, O, w, q, m, and k.

Each of these images is 28 pixels by 28 pixels, for a total of 784 pixels. Each image can be thought of as a 784-dimensional data point, represented as a list containing 784 numbers. For the images shown above, many of the numbers would be zero since the image is mostly white.

To load the MNIST dataset instead of the 2D points, change the code at the bottom of the kmeans.py to the following:

if __name__ == "__main__":
    dataset = "mnist"

    data, label = read_data("data/" + dataset + ".csv")
    init_c = load_centroids("data/" + dataset + "_init_centroids.csv")
    final_c = main(data, init_c, dataset)
    write_centroids_with_key(dataset + "_final_centroids.csv", final_c)

Only the variable dataset should change, from the string "2d" to "mnist".

You should leave the rest of your code un-changed and it should run using the new data without errors. Run this command again (it may take a minute to run on this data set):

python kmeans.py

You should see the following output:

K-means converged after 13 steps.

If you see a different number of steps, or it takes longer than a minute or two there may be an error in your code.

In the newly-created results/MNIST folder, there should be init and final folders filled with centroids plotted as images. (If you notice that one of the centroids is missing, that’s okay! We’ll talk about this more in the next section.)

In the init folder, the centroids should look like random pixels (below left), because we initially created them with random values. In the final folder, they should look roughly like actual letters (below right) after similar data points (images) have been clustered together.

Two pixel grid images showing the composite "Centroid 0". The grid on the left shows random noise, as the centroid has been randomly initialized. The grid on the right shows a blurry outline of the letter "S" after k-means algorithm has finished.

This shows that k-means has adjusted the locations of the centroids from our random initialization to a more meaningful space that corresponds to where the MNIST data is distributed.

If you were to plot each step of the algorithm between the init and final stages, you would see a progression from a blurry blob into something that looks like an actual character. (If the animation is stuck on “step26” even after clicking the “Restart Animation Button”, try holding shift while clicking your browser’s reload page button)

An animation of a pixel grid representing a centroid transitioning from random noise in step 0 to an outline of the letter "S" by step 26.

For example, in this animation, the centroid is a random distribution of pixels that slowly converges on something approximating an ‘S’. Note that in this case, many of the intermediate steps make no change from each to the next. Since there are 11 other centroids, many of the others are still moving and changing before the algorithm converges.

Part 2: Analyzing Clustering Results

In this part, you will analyze the results produced by your k-means implementation.

For this part, you’ll work in the analysis.py and answers.txt files.

You should not modify kmeans.py in Part 2.

We have provided some starter code for you near the bottom of analysis.py:

if __name__ == "__main__":
    centroids = load_centroids("mnist_final_centroids.csv", with_key=True)
    # Consider exploring the centroids data here

    data, label = read_data("data/mnist.csv")
    print(accuracy(data, label, centroids))

Inspect the centroids dictionary that was generated by adding print statements into the code after the call to load_centroids. You should also look through the results/MNIST/final folder and look at the generated images.

How many centroids are there in the final dictionary? In Part 1, we initialized the 2D algorithm with 2 randomly chosen points as the initial centroids and they both remained when the algorithm converges after 7 steps.

But what if we had used different starting centroids? In the 2D dataset from Part 1, there’s nothing particularly special about (0.332..., 0.179...) as a starting centroid. In fact, we could have started with any random point in 2D space. This is similarly true for the starting centroids we gave you for the MNIST data. The actual data set is 784 pixels, each varying from 0 to 255. Organized in a 28x28 grid and rendered into an image, most of them look like a handwritten letter. But if you look at the initial centroids (in data/mnist_init_centroids.csv), they only look like random noise: those 784 pixels were randomly generated.

If you look at the data for the initial MNIST centroids, in the data/mnist_init_centroids.csv file, you should see that there are 12 lines, each with 784 numbers. Each of these lines represents one of the starting centroids. Inspecting the final centroids dictionary, or counting the number of images in the results/MNIST/final folder, you should see that there are only 11 centroids.

Problem 5: What Happened to One of the Centroids?

Leave your answer to this question in the appropriate place in the answers.txt file.

Why are there fewer centroids? What do you suspect happened?

Problem 6: Assigning Labels to Centroids

When we implemented the assign_points_to_centroids() function in kmeans.py, it assigned each data point to the centroid it is closest to. This works great for our k-means clustering, but now we are not only interested in which centroid the data point is assigned to, but also the true label of that data point in order to perform the accuracy analysis.

If you look at the data/mnist.csv file, you will see that each data point has a letter as the first character. For example, the first data point is:

M,0,0,0,0,...,218,251,254,254,218,79,3,1,22,127,208,...,0,0,0,0,0,0,0

In this data set, each point represents a single handwritten letter. The goal of the k-means algorithm is to group similar data points together, in the hopes that each group represents a single letter. In other words, we want to be able to say that, given a handwritten letter, we can predict what letter it is by looking at the centroid it is closest to. (If all of the Ms are closest to centroid1, then it’s a good guess that a newly discovered handwritten letter that’s also closest to centroid1 is an M, too.)

To help us determine how accurate those groupings are, the MNIST dataset also includes the true label for all of the data points. By assigning labels to the centroids, we can calculate how accurate our groupings are.

In analysis.py, you will find a function named assign_labels(). This is a similar function to assign_points_to_centroids() but instead of assigning points to centroids, it assigns the actual labels of the data point. For example, if the points assigned to centroid1 (aka, the points closest to centroid1) are assigned labels ‘M’ and ‘N’, then the labels for centroid1 would similarly be ‘M’ and ‘N’.

For example, given

list_of_points = [[1.1, 1, 1, 0.5], [4, 3.14, 2, 1], [0, 0, 0, 0]]
labels = ['M', 'W', 'N']
centroids_dict = {"centroid1": [1, 1, 1, 1],
                  "centroid2": [2, 2, 2, 2]}

Then calling assign_labels(list_of_points, labels, centroids_dict) would return

{'centroid1': ['M', 'N'], 'centroid2': ['W']}

Instead of a dictionary whose values are lists of lists, the dictionary returned from assign_labels should be a dictionary whose values are lists of strings.

Just like kmeans.py, we have provided tests for you to verify your implementation. To run these tests, run the following command in the terminal:

python test_analysis.py

You should see the following output after you have correctly implemented assign_labels():

test_assign_labels passed.
Exception: majority_count() returned None; have you implemented it yet?

As with kmeans.py in Part 1, this means you can continue to the next part of the assignment.

Problem 7: Majority Label Count

Implement majority_count() to return the count of the majority label. This function should take in a list of labels, determine which label appears most frequently, and return the count of that label.

For example, given:

labels = ['M', 'M', 'N']

Then calling majority_count(labels) should return 2.

You should see the following output from python test_analysis.py after you have correctly implemented majority_count():

test_assign_labels passed
test_majority_count passed
Exception: accuracy() returned None; have you implemented it yet?

Problem 8: Compute the Accuracy of the Algorithm

One way to evaluate our algorithm would be to examine the accuracy of the classification of the images from MNIST. For example, how many of the images labeled as the letter ‘K’ ended up closest to the same centroid?

The images in the MNIST data set have been labeled with the letter they actually represent — you can see this label as the letter in the first column on each row in mnist.csv. Each row of that file represents one data point.

First, let’s define the “classifications” of our data points provided by k-means clustering by taking a majority-takes-all approach: we assign each cluster the label corresponding to the (relative) majority of the data points in the cluster. For example, suppose we have data points with these labels assigned to centroid1:

Q, Q, Q, Q, Q, Q, O, P, Q, Q, O, P, P, Q, Q

Since there are 10 Qs, 3 Ps, and 2 Os in this cluster, we say the label of centroid1 is Q since it is the label that appears the most frequently.

We can then define the accuracy of the classification for a centroid as:

accuracy=number of majority labelsnumber of labels \text{accuracy} = \frac{\text{number of majority labels}}{\text{number of labels}}

In our example, the majority label is Q, there are 10 label Q’s, and there are 15 labels in total. So we say the accuracy for centroid1 is:

accuracy=1015=0.667 \text{accuracy} = \frac{10}{15} = 0.667

To determine the overall accuracy of our algorithm, we’ll determine the majority label for each cluster of points, add those counts together, and then divide that by the total number of labels.

The accuracy of the algorithm is then defined as:

accuracy=n=1len(centroids)majority label count for centroid ntotal number of labels \text{accuracy} = \frac{\sum_{n=1}^{\texttt{len(centroids)}} \text{majority label count for centroid } n}{\text{total number of labels}}

In other words, our measure of accuracy is the percentage of data points which are in the majority in their own cluster.


In analysis.py, use assign_labels() (Problem 6) and majority_count() (Problem 7) to implement the accuracy() function, which calculates the accuracy of our k-means clustering algorithm according to the above equation:

def accuracy(list_of_points, labels, centroids_dict):
    """
    ...

    Returns: a float representing the accuracy of the algorithm
    """

You should see the following output from running test_analysis.py once you have correctly implemented accuracy():

test_accuracy passed.

At this point, if you completed all parts to analysis.py you should also see following output when you run the tests:

all tests passed.

Problem 9: Analyze the Results

You have now completed all the parts required to run the analysis!

The code at the end of the file will help to print the accuracy of the algorithm for this sample:

if __name__ == "__main__":
    centroids = load_centroids("mnist_final_centroids.csv", with_key=True)
    # Consider exploring the centroids data here

    data, label = read_data("data/mnist.csv")
    print(accuracy(data, label, centroids))

Run the program with the command

python analysis.py

Note that the file name of the images in the results folder does not necessarily correspond with the label assigned to the character (e.g., centroid0.png is not necessarily an image of ‘0’). Instead, try to figure out the label of the image by looking at the image itself. What is the output? Is this value surprising? It might be helpful to take another look at the final centroid images from MNIST in the results/MNIST/final folder.

In your opinion, based on these centroids, which letters are easier for the algorithm to distinguish and which are harder?

Leave your answer in the appropriate place in the answers.txt file.

You may find it helpful to add some print statements to the accuracy() function. Although we ask for the accuracy of the algorithm overall, it might be illustrative to print the accuracy for each centroid (refer back to Problem 8 for details). You may also find it helpful to (temporarily) modify the call to majority_count() to print out the majority label for each centroid.

If you add any print statements for debugging, remove them before submitting to Gradescope.


Code Quality

Info

Make sure to provide descriptive comments for each function in docstring format

Your assignment should pass two checks: flake8 and our code quality guidelines. The code quality guidelines are very thorough.

Collaboration

Warning

If you discuss an assignment with one or more classmates, you must specify with whom you collaborated in the header comment in your submission. You may discuss with as many classmates as you like, but you must cite all of them in your work. Note that you may not collaborate in a way that is prohibited, even if you cite the collaboration. Please consult the course syllabus for more details.

Submission

Submit your completed kmeans.py, analysis.py, and answers.txt files to Gradescope.

Your files must be named kmeans.py, analysis.py, and answers.txt in order to be graded correctly. Please comment out any assert or print statements before submitting to gradescope.