Background - CSE 160

Info

Watch this video by a TA that provides an overview of the assignment.

K-means clustering is a popular machine learning and data mining algorithm that groups data into clusters. Finding these clusters can often reveal interesting and meaningful information about the distribution of data. K-means clustering is used for a number of different applications, such as recognizing hand-written digits and letters, which we will be implementing in this assignment.

K-Means Clustering Overview¶

K-means clustering works in four steps:

Initialize some number ( $k$ ) of cluster centers (centroids).
For each data point in the dataset, assign it to its closest centroid.
Update the location of each centroid to be the average of all the points assigned to that cluster.
Repeat steps 2 and 3 until convergence (i.e. the clusters reach a steady state).

The data points do not change. Only the locations of the centroids change with each iteration. As the centroids move, the set of data points that are closest to each centroid changes.

The following plots show an example of a single run of the k-means clustering for $k = 2$ :

Three scatter plots of the same data. In the first step, two "centroids" are randomly located. Then, each data point is assigned to its closest centroid. Finally, the cluster centroids are updated to be the average of all data points in the cluster.

As with many machine learning techniques, there is a lot of associated terminology. But much of this terminology describes familiar concepts. Below, we’ll explore these concepts in more details:

Distance¶

A simple way we can calculate how close a data point is to a centroid is with Euclidean distance, which is the length of a straight line between two points. We can calculate the Euclidean distance by using the Pythagorean theorem. If we have two points $a = (a_1, a_2, …, a_n)$ and $b = (b_1, b_2, …, b_n)$ (these are coordinates in $n$ -dimensional space), we can define the euclidean distance between both points as:

$D(a, b) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + (a_n - b_n)^2}$

That is, the square root of the sum of the squared differences of the points. For example, if we had two data points from a 3-dimensional dataset x1 = [-2, -1, -4] and x2 = [10, -5, 5], the distance between x1 and x2 can be found by

$D(x1, x2) = \sqrt{(-2 - 10)^2 + (-1 - (-5))^2 + (-4 - 5)^2} = \sqrt{241}$

Clusters¶

A cluster is a collection of points that are part of the same “group”. With k-means, every point is assigned to a single cluster. As the algorithm progresses and the cluster centroids shift, points may change what cluster they are in, even though the point itself does not move. For example, in the picture below, all data points in the blue region are part of cluster1, all data points ub the red region are part of cluster2, and all data points in the green region are part of cluster3.

A scatter plot of data points divided into a blue, red, and green region, each of roughly equal size. The "centroid" for each region (cluster) is the average position of all points in the cluster.

Centroids¶

A centroid is the center of a cluster. When using k-means, we assume that the centroid of a cluster is the average location of all the points in that cluster. This is equivalent to the average of the data points’ components in each dimension. If we have three 3-dimensional points $a$ , $b$ , and $c$ , the average is defined as:

$average = [\frac{a_1 + b_1 + c_1}{3}, \frac{a_2 + b_2 + c_2}{3}, \frac{a_3 + b_3 + c_3}{3}]$

For example, if a cluster contains three 2-dimensional points:

x1 = [1, 1]
x2 = [2, 3]
x3 = [3, 2]

The centroid for this cluster is defined as:

$c = [\frac{1 + 2 + 3}{3}, \frac{1 + 3 + 2}{3}] = [2, 2]$

The following image is a graphical representation of these three points and their centroid:

A graphical grid with the points (1,1), (2,3), and (3,2). The centroid of these three points is at location (2,2), which is the average of all three points' locations.

Convergence¶

We say our algorithm has converged if the locations of all centroids did not change much between two iterations (e.g. within some threshold, such as $1 * 10^{-5}$ or $0.00001$ ). For example, if we have two 2-dimensional centroids $c1$ and $c2$ :

c1 = [0.45132, -0.99134]
c2 = [-0.34135, -2.3525]

After running the next iteration of the algorithm, the centroids are updated to:

c1 = [1.43332, -1.9334]
c2 = [-1.78782, -2.3523]

In this case, the algorithm has not converged, since the location of the two centroids are very different between iterations. However, if the new locations of the centroids are instead:

c1 = [0.45132, -0.99134]
c2 = [-0.34135, -2.3524]

Then we can conclude that the algorithm has converged, since the locations of the two centroids remained very similar between iterations.

K-Means Clustering Algorithm¶

The K-means algorithm (in pseudo code) is described as follows:

data = load data
centroids = randomly created k centroids

while centroids not converged:
    for each data point:
        assign data point to the closest centroid
    for each centroid:
        set new centroid location to be the mean of
        all points in this cluster

The following animation visualizes this process:

A GIF showing the k-means clustering algorithm over 7 steps. Each step, the centroids of the red and blue clusters update to the average of the points in their clusters. While the initial updates significantly change the position of the centroids, by the later steps the centroids to not move much.

The initial frame (step0) is the initialization: two randomly chosen centroids are drawn as an X in blue (centroid 0) and red (centroid 1). In the following steps, (step1 through step6):

Data points are assigned to whichever centroid is closest (and colored accordingly),
Centroid locations are adjusted based on the mean of the points assigned to them.

This process is repeated until the centroids do not move anymore. As we progress through the steps, and especially in the later ones, the centroids do not move as much between steps. Throughout the algorithm, the data points never move but they may be assigned to different centroids as they are updated.

You can also check out Naftali Harris’s Visualizing K-Means Clustering site to play with k-means and get more familiar with it.

With a high-level understanding of the k-means clustering algorithm, you’re ready to start implementing it in Python! Take a look at the assignment specifications for next steps.