K-Means clustering 3

The main point of this worksheet is to get more comfortable with the different steps involved in the K-Means algorithm. This worksheet is more about the mathematical algorithm, and less about how to run it using scikit-learn.

import pandas as pd
import altair as alt

Finding the closest point

df_centers = pd.DataFrame([[-4,10],[0.2,0.8],[2,-3],[5,-1],[-4,5],[-2,-1],[-7,-1.45]],columns=["x","y"])
len(df_centers)
7
  • Which of the 7 points in df_centers is closest to (-3,3)? Which is furthest from \((-3,3)\)? Look at the Altair chart below to answer these questions.

alt.Chart(df_centers).mark_circle(size=100).encode(
    x="x",
    y="y"
)
  • Create a length 2 pandas Series s from the dictionary {"x":-3, "y":3}. We will think of this pandas Series as corresponding to the point \((-3,3)\). (Use pd.Series.)

  • Create a length 2 pandas Series from the 0-th row of df_centers. Name the result t. (You shouldn’t have to type pd.Series in this problem, nor should you have to type a colon : symbol. Use iloc.)

  • The compute the square of the distance between s and t. (We eventually want to find the point whose distance is closest, but it’s the exact same to find the point whose squared distance is smallest. This saves us from having to compute a square root.)

Be sure to solve this problem without typing any numbers. You should only have to type variables like s and labels like "x". The correct answer is 50.

Compute the squared distance between s and every point in df_centers. Don’t use any for loops. For example, you can use code like s["x"] - df_centers["x"]. The result should be a pandas Series like the following:

0    50.0000
1    15.0800
2    61.0000
3    80.0000
4     5.0000
5    17.0000
6    35.8025
dtype: float64
  • What is the integer location of the closest point to s? Of the furthest point? Use the pandas Series methods argmin and argmax (documentation).

  • Look at the points in df_centers at these locations (use iloc). Does it match what you expected from the Altair chart above?

A function to find the closest point

  • Write a function closest_arg which again takes two inputs, a Series like s and a DataFrame like df_centers, and which as output returns the integer location of the row containing the closest point in df_centers to s.

For example, if you evaluate closest_arg(s, df_centers) with s and df_centers defined as above, then the output should be 4.

K-Means clustering steps

Here we’re going to switch to a smaller df_centers DataFrame, but instead of a single point s, we are going to look at many points.

We start out by defining df_centers.

  • Execute the following cell. You don’t need to make any changes.

df_centers = pd.DataFrame([[0,0], [5,0], [0,5]], columns=["x","y"])

Now we’re going to generate a lot of data points using a scikit-learn function make_blobs. This is similar to the make_regression function from one of this week’s videos, but instead of generating data on a line, it generates data in clusters.

The make_blobs function accepts many different keyword arguments. Here are the ones we are using. Feel free to change these to see how the result changes.

  • n_samples: the number of points to generate.

  • n_features: how many columns for the resulting NumPy array.

  • random_state: to make sure we always get the same results.

  • cluster_std: how far spread out each individual cluster will be. (The smaller this number, the more distinct the clusters will look.)

  • centers: how many clusters we want.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=100, n_features=2, random_state=5, cluster_std=1.6, centers=4)
  • What type of object is X?

  • What type of object is _?

  • What is the shape of X? Does it make sense?

  • Convert X into a pandas DataFrame and save it with the variable name df_pts. Use the column names “x” and “y”.

  • Define two Altair charts using the following code. In the c_centers chart, we are putting some keyword arguments into mark_point() (so instead of mark_point() we are using mark_point(shape="diamond", size=100, color="black", fill="black")). This will make the centers stand out more.

c_pts = alt.Chart(df_pts).mark_circle().encode(
    x="x",
    y="y"
)

c_centers = alt.Chart(df_centers).mark_point(
    shape="diamond",
    size=100,
    color="black",
    fill="black"
    ).encode(
    x="x",
    y="y"
)
  • Display both of these charts on the same axes. Instead of using c_pts|c_centers (for side-by-side) or c_pts&c_centers (for one above the other), use c_pts+c_centers.

Think of these three diamonds as the random starting points for K-Means clustering. (We secretly know that there are four clusters, which means it would be better to use 4 center points, but in real situations we won’t know the true number of clusters, and we can use any number of center points we want.)

Find which center is closest

For each of the 100 points in df_pts (each of the 100 blue circles from the Altair chart), we want to find which of the center points is closest.

For example, think of df_pts.iloc[30] as like the s Series from above.

df_pts.iloc[30]
  • If we want to know which of the black diamonds is closest to df_pts.iloc[30], we can use the closest_arg function above.

If all your numbers are the same as mine, you should find that it’s closest to point 2 and that the point at iloc[29] is closest to point 1. The only possible answers you should see are 0,1,2, because there are only 3 center points.

Using apply

  • Now apply that same function to each row in df_pts using apply and a lambda function. Use code of the form df_pts.apply(lambda s: ???, axis=???). Think of the variable s in this lambda function as like df_pts.iloc[30] from above.

The result should be a length 100 pandas Series.

  • Put this resulting pandas Series into a new column in df_pts called “closest”.

  • Update the c_pts chart from above so that the color is determined by the “closest” column. You shouldn’t have to copy-paste anything. Just use c_pts = c_pts.encode(color=???). Do you see how the points are now colored according to which of the black diamonds is closest?

  • Specify a Nominal encoding data type.

  • Redisplay c_pts+c_centers.

Finding the centroids

Using pandas groupby, it is surprisingly easy to find the centroid of each individual cluster. The intuition is that the coordinates of the centroid are given by the average value of the points in that group.

  • Evaluate df_pts.groupby("closest").???(). What method do we want to use to find the averages for each individual cluster?

  • Save the result with the name df_new_centers.

  • Make a copy of c_centers named c_new_centers.

  • Change the c_new_centers.data attribute to use the new DataFrame.

  • Display c_pts and c_new_centers on the same axes. Does it look like the black diamonds are now in the center of the clusters?

  • Do you see some points that will change color if we repeat the coloring step (because they’re now closer to a different center point)?

Summary

We have begun the K-Means clustering algorithm. The next step would be to re-color the points, according to which of the new center points is closest. Then we would compute the new centroids, and keep repeating this procedure until there is an iteration in which nothing changes.

Created in deepnote.com Created in Deepnote