Week 6 Monday#

Announcements#

  • I have office hours after class at 11am, next door in ALP 3610.

  • Videos and video quizzes posted; due Friday before lecture.

  • Midterms should be available on Gradescope Tuesday.

  • Plan for today: Discussion of the K-means algorithm, followed by time to work on Worksheet 9.

The K-means algorithm#

Here we describe the K-means algorithm. On the worksheet, you will use a KMeans object from scikit-learn, which will run this algorithm automatically.

We choose the number of clusters we want to search for, n_clusters. For this example, let’s say n_clusters=3 and let’s say there are 500 points in the dataset.

  1. Randomly choose 3 points, called centroids.

  2. For each of the 500 points in the dataset, assign it to the nearest centroid. We have now divided the data into 3 clusters.

  3. Compute the centroids (also called averages, also called means) of each of these 3 clusters. These 3 centroids are what is meant by the phrase K-means (in this case, K = 3).

  4. For each of the 500 points in the dataset, assign it to the nearest centroid. Continue repeating this process (assign to a cluster, compute the centroid of each cluster) until the process terminates. (By terminates, we mean each point remains in the same cluster.)

import numpy as np
import pandas as pd
import altair as alt

Making an artificial dataset for clustering#

from sklearn.datasets import make_blobs

The make_blobs function returns two things, but we will only need the first one, which we assign to the variable X. A convention in Python is to use underscore _ for a variable name that won’t be used.

For the keyword arguments to make_blobs:

  • Think of n_samples as the number of points (also the number of rows in the NumPy array).

  • Think of n_features as the number of dimensions (also the number of columns).

  • The argument centers refers to how many blobs will be made.

  • The random_state=1 is to have reproducible random numbers. (If you run this code, using the same parameters, especially the random_state parameter, you should get the exact same numbers in X.)

# Notice centers does not equal n_clusters
X, _ = make_blobs(n_samples=500, n_features=2, centers=4, random_state=1)

This X is a NumPy array.

type(X)
numpy.ndarray

It has 500 rows and 2 columns. Today we want to plot the points, so using 2 columns makes the most sense. In the worksheet today, we will be working with higher-dimensional data.

X.shape
(500, 2)

NumPy arrays share many similarities with pandas DataFrames, but the rows and columns are not labeled in a NumPy array. For example, there is no attribute X.columns.

X.columns
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [8], line 1
----> 1 X.columns

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

Here are the first 5 rows in X.

X[:5]
array([[ -6.92324165, -10.66953197],
       [ -8.63062033,  -7.13940564],
       [ -9.63048069,  -2.72044935],
       [ -2.30647659,   5.30797676],
       [ -7.57005366,  -3.01446491]])

The first iterations by hand#

Let’s use the following three points for our initial centroids.

[[-4, -10], [-4, -8], [-10, -10]]

Because we called the make_blobs function ourselves using centers=4, we secretly know there are 4 clusters. But in real-world scenarios, we probably won’t know how many clusters there are. To emphasize that, we will try to divide the data into 3 clusters.

n_clusters=3

The first step of the K-means algorithm is to choose n_clusters random centroids (they don’t actually have to be centers of anything). Here are our guesses; any numbers would make the same amount of sense.

centroids = np.array([[-4, -10], [-4, -8], [-10, -10]])
centroids.shape
(3, 2)

So that we can plot using Altair, we will put our random data from make_blobs into a DataFrame.

df_data = pd.DataFrame(X, columns=["x", "y"])

We do the same thing for our centroids data. This is a much smaller DataFrame (just 3 rows because we are looking for 3 clusters).

df_centroids = pd.DataFrame(centroids, columns=["x","y"])

Notice how the data does seem to lie in 4 clusters, as expected.

c1 = alt.Chart(df_data).mark_circle().encode(
    x="x",
    y="y"
)

c1

Here are our initial guesses for the centers. We add some customization to make the points more visible. (Notice how we are using mark_point instead of the usual mark_circle.

c2 = alt.Chart(df_centroids).mark_point(
    size=500,
    shape='cross',
    filled=True,
    stroke="black",
).encode(
    x="x",
    y="y",
)

c2

Here we see the two charts side-by-side.

alt.hconcat(c1,c2)

More helpful is to have the charts “layered” one on top of the other.

alt.layer(c1,c2)

We haven’t really started the K-means clustering algorithm yet. All we’ve done is choose our initial guesses for the centroids. The next step is to assign each point to its nearest centroid. Let’s try to do that at first with a single point, the 20-th row of X. We can access that 20-th row using X[20]. (This would not work if X were a DataFrame, because then pandas would be looking for a column with label the integer 20.)

z = X[20]
z
array([-10.4768696 ,  -3.60318139])

As a reminder, these are our centroids. Notice how our point z is closest to the last of these three points, centroids[2].

centroids
array([[ -4, -10],
       [ -4,  -8],
       [-10, -10]])

It’s not obvious that subtracting a length 2 one-dimensional NumPy array from a 3-by-2 two-dimensional NumPy array should make sense, but it does, because of NumPy’s rules of broadcasting. (Those rules are tricky to wrap your head around at first; for now, just notice that the following works.) For example, the 0.47... in the lower-left entry corresponds to -10 - -10.47....

# trick uses what's called "broadcasting" in NumPy
centroids - z
array([[ 6.4768696 , -6.39681861],
       [ 6.4768696 , -4.39681861],
       [ 0.4768696 , -6.39681861]])

We now square each of these terms individually. (Notice how squaring a NumPy array automatically performs this squaring operation elementwise.)

(centroids - z)**2
array([[41.94983985, 40.91928835],
       [41.94983985, 19.3320139 ],
       [ 0.22740462, 40.91928835]])

There’s nothing special about the ** notation, it’s just the (strange) “usual” way to raise to a power in Python.

6**2
36

We next add up the squared differences along each row. Take a minute and convince yourself that axis=1 is the correct keyword argument to use in this case.

((centroids - z)**2).sum(axis=1)
array([82.86912819, 61.28185375, 41.14669297])

If we really wanted to know the distance between z and each point in centroids, we should now take a square root of each element. All we care about is, which is the smallest, so there is no need to take that square root. We save some computational power by not taking the square root.

When we call argmin, we get as output 2, for this particular choice of z. This corresponds to what we said above, that z is closest to centroids[2]. (I believe there is no idxmin in NumPy, because NumPy arrays don’t have labels.)

((centroids - z)**2).sum(axis=1).argmin()
2

Let’s turn that procedure we just did into a function.

def closest_centroid_index(z, centroids):
    return ((centroids - z)**2).sum(axis=1).argmin()

Here we check that the function is working as expected.

closest_centroid_index(z, centroids)
2

That function was applied to a single row, the row X[20]. We want to apply it to every row. (Take a minute to convince yourself that axis=1 is the keyword argument to use.) We can do that using apply. It would be nice if we could just use

df_data.apply(closest_centroid_index, axis=1)

but because closest_centroid_index takes two input arguments, we won’t be able to use this directly.

I had expected

df_data.apply(lambda row: closest_centroid_index(row, centroids), axis=1)

to work, but there is a subtle difference between z, which we used above, and a row, like we are using here. I will have to think more about what is going wrong in the above case. To get around it, we use row.values, which converts the pandas Series row into a NumPy array. Since z above was a NumPy array, this is a natural choice.

The following values tell us, which of the three centroids is closest to each point (i.e., to each row).

df_data.apply(lambda row: closest_centroid_index(row.values, centroids), axis=1)
0      0
1      2
2      2
3      1
4      1
      ..
495    1
496    1
497    2
498    2
499    1
Length: 500, dtype: int64

Here is a reminder of what centroids looked like.

centroids
array([[ -4, -10],
       [ -4,  -8],
       [-10, -10]])

And a reminder of what df_data looked like.

df_data
x y
0 -6.923242 -10.669532
1 -8.630620 -7.139406
2 -9.630481 -2.720449
3 -2.306477 5.307977
4 -7.570054 -3.014465
... ... ...
495 -7.827495 -2.510321
496 -6.380881 -8.506638
497 -8.960149 -8.063499
498 -7.666039 -7.597155
499 -6.465344 -2.855446

500 rows × 2 columns

Let’s store these values in a new column in df_data.

df_data["first cluster"] = df_data.apply(lambda row: closest_centroid_index(row.values, centroids), axis=1)

Here is the resulting chart, but it will be more meaningful below when we add in the centroids.

c1 = alt.Chart(df_data).mark_circle().encode(
    x="x",
    y="y",
    color="first cluster:N"
)

c1

Here is a shorthand for layering the charts together; instead of using alt.layer(c1, c2), we use c1+c2.

For example, notice how there are just a few points, the blue points, which are closest to the lower-right centroid.

c1+c2

This already took most of class. The next step would be to compute the centroids of these clusters, and then repeat the above procedure using the new centroids. We continue repeating this two step process (compute centroids, assign points to the closest centroid) until the process terminates, meaning that none of the cluster values change. We will see more steps of this process on Wednesday.