Download the starter code for this assignment. Then, extract (unzip) the contents somewhere on your computer (preferably in a dedicated “CSE160” folder and NOT in your Downloads). In VSCode, navigate to File | Open Folder and select the homework4 folder. Inside the homework4 folder you’ll find a number of folders and files:

The folders:

data: a directory containing the data your program needs
results: a directory where your plots will be automatically saved. (Note that this folder will be created automatically when you run the complete kmeans.py at the end of Part 1.)

Python files:

analysis.py: where you’ll do your analysis for part 2
answers.txt: where you’ll write your answers for questions in part 2
kmeans.py: where you’ll write your k-means clustering algorithm for parts 1 and 2
test_kmeans.py: program to test kmeans.py
test_analysis.py: code to test analysis.py
utils.py: helper code for the assignment

Caution

Do not modify anything in utils.py. The file is provided to help you complete your assignment.

Info

If the starter code you downloaded doesn’t have the right structure (except for the results folder), pictured below, try downloading it again by accessing the same site via Incognito mode. Note that the results folder will be created automatically when you run your completed kmeans.py.

Part 1: Implementing K-means clustering¶

In Part 1, you will design and implement a k-means clustering algorithm. You only need to modify kmeans.py for Part 1. Below is an overview of the data structures data and centroids, which you will need to work with in order to implement the algorithm:

data: A list of data points, where each data point is a list of floats.

Example: data = [[0.34, -0.2, 1.13, 4.3], [5.1, -12.6, -7.0, 1.9], [-15.7, 0.06, -7.1, 11.2]] Here, data contains three data points, where each data point is four-dimensional. In this example, point 1 is [0.34, -0.2, 1.13, 4.3], point 2 is [5.1, -12.6, -7.0, 1.9] and point 3 is [-15.7, 0.06, -7.1, 11.2].

centroids: A dictionary of centroids, where the keys are strings (centroid names) and the values are lists (centroid locations).

Example: centroids = {"centroid1": [1.1, 0.2, -3.1, -0.4], "centroid2": [9.3, 6.1, -4.7, 0.18]}. Here, centroids contain two key-value pairs. You should NOT change the names of the centroids when you later update their locations.

Info

When you first run the tests (before you implement anything) you will see an error about missing implementations. As you work your way through the assignment, you’ll start to see fewer errors as the tests pass. Seeing assertion errors for parts that you haven’t implemented yet is normal and expected. You can run your tests at any time by opening the terminal (Terminal |New Terminal) and typing python test_kmeans.py or by just running the test_kmeans.py file in VSCode.

We have provided several test functions that you should use to check your implementation as you go through the assignment. To run the tests, open the terminal (Terminal | New Terminal or control-`) and paste in python test_kmeans.py then press enter/return.

You will only be working in kmeans.py for Part 1. Here is a brief overview of the steps, which are described below:

Implement euclidean_distance(), which calculates the distance between two points.
Implement get_closest_centroid(), which allows you to find the closest centroid for a point.
Implement assign_points_to_centroids(), which sets the centroid assignment for a point.
Implement mean_of_points(), which calculates the mean of a list of points.
Implement update_centroids(), which updates all of the centroids to the new mean of the points assigned to them.
Run the entire algorithm!
Check code quality
Submit your work!

All of the problems in Part 1 are described as function header documentation. Those function headers are copied here and are the same as what’s in the kmeans.py file.

Problem 1: Calculating Euclidean distances¶

Implement euclidean_distance() using the euclidean distance formula to calculate the distance between two points. For example, given

point1 = [1.1, 1, 1, 0.5]
point2 = [4, 3.14, 2, 1]

Then euclidean_distance(point1, point2) should return:

3.7735394525564456

(Your exact precision may vary.)

Likewise, given

point1 = [2, 2]
point2 = [-2, -1]

Then euclidean_distance(point1, point2) should return:

5.0

Refer to the distance section in the Background for more information on how this is calculated.

After writing this function, you should run the tests for kmeans.py by running the following command in the terminal:

python test_kmeans.py

You should see the following output once you have correctly implemented euclidean_distance():

test_euclidean_distance passed.

If you receive an error that looks like

Exception: euclidean_distance() returned None; have you implemented it?

You should check your implementation of euclidean_distance() to make sure it returns a valid value. The exception message will change to reflect the next un-implemented function as you progress through the assignment.

Problem 2: Assigning data points to closest centroids¶

Implement get_closest_centroid() to find the closest centroid for a data point. For example, given

point = [0, 0, 0, 0]
centroids_dict = {"centroid1": [1, 1, 1, 1],
                  "centroid2": [2, 2, 2, 2]}

Then get_closest_centroid(point, centroids_dict) should return

"centroid1"

You should see the following output from test_kmeans.py once you have correctly implemented get_closest_centroid():

test_closest_centroid passed.

As with the Euclidean Distance function, you will also see an error message for the next test, as you have not implemented those functions yet.

Problem 3: Update assignment of points to centroids¶

Implement assign_points_to_centroids() to assign data points to their closest centroid. For example, given a list of points and centroids dictionary:

list_of_points = [[1.1, 1, 1, 0.5], [4, 3.14, 2, 1], [0, 0, 0, 0]]
centroids_dict = {"centroid1": [1, 1, 1, 1],
                  "centroid2": [2, 2, 2, 2]}

Then assign_points_to_centroids(list_of_points, centroids_dict) should return:

{'centroid1': [[1.1, 1, 1, 0.5], [0, 0, 0, 0]],
 'centroid2': [[4, 3.14, 2, 1]]}

If there are no data points assigned to a centroid, then you should not include that centroid in the returned dictionary.

You should see the following output from running test_kmeans.py once you have correctly implemented assign_points_to_centroids():

test_assign_points_to_centroids passed.

Problem 4: Update centroids¶

Info

Do not modify list_of_points or hard-code the dimensionality of the data. Your code should be able to run on data points with any dimension.

Implement mean_of_points(list_of_points) to find the average of a cluster. For example, given

list_of_points = [[1.1, 1, 1, 0.5], [4, 3.14, 2, 1], [0, 0, 0, 0]]

Then mean_of_points(list_of_points) should return:

[1.7, 1.3800000000000001, 1.0, 0.5]

Then, implement the update_centroids(assignment_dict) to update the centroid to be the average of the clusters.

For example, given an assignment dictionary whose keys are centroid names and values are lists of points, say:

assignment_dict = {'centroid1': [[1.1, 1, 1, 0.5], [0, 0, 0, 0]],
                    'centroid2': [[4, 3.14, 2, 1]]}

Then update_centroids(assignment_dict) should return

{'centroid1': [0.55, 0.5, 0.5, 0.25],
 'centroid2': [4.0, 3.14, 2.0, 1.0]}

You should see the following output from test_kmeans.py once you have correctly implemented mean_of_points() and update_centroids():

test_mean_of_points passed.
test_update_centroids passed.

At this point, if you completed all parts to the k-means clustering algorithm you should see following output when you run the tests:

Clustering 2D points and Checking your work¶

Now that you have completed all parts needed to run your k-means clustering algorithm, we can now run it on actual data. At the bottom of kmeans.py you should see code that will do exactly that. This code loads the 2-dimensional data points and the initial centroids we provided in the respective .csv files. After that, the main() function runs the k-means algorithm and write_centroids_tofile() saves the final centroids to a file.

You do not need to modify anything in this block of code for Part 1, but you should make note of its existence as you’ll make a change in Part 2.

if __name__ == "__main__":
    dataset = "2d"

    data, label = read_data("data/" + dataset + ".csv")
    init_c = load_centroids("data/" + dataset + "_init_centroids.csv")
    final_c = main(data, init_c, dataset)
    write_centroids_with_key(dataset + "final_centroids.csv", final_c)

Now run your program by typing python kmeans.py in the terminal window. If your code works as expected, you should see:

K-means converged after 7 steps.

In addition, there should be 7 plot images generated in a newly-created results/2D/ folder. If you examine these 7 plots, they should be identical to the steps shown in the the gif shown in the background section of this spec.

Quality¶

Info

Make sure to provide descriptive comments for each function in docstring format.

Warning

If you discuss this assignment with one or more classmates, you must specify with whom you collaborated in the header comment in your submission. Otherwise, you may be guilty of academic misconduct and face consequences. Note that you may not collaborate in a way that is prohibited, even if you cite the collaboration.

Your assignment should pass two checks: flake8 and our code quality guidelines. The code quality guidelines are very thorough.

Submission¶

Submit kmeans.py to Gradescope for Part 1.