K-Means clustering 3
Contents
K-Means clustering 3¶
The main point of this worksheet is to get more comfortable with the different steps involved in the K-Means algorithm. This worksheet is more about the mathematical algorithm, and less about how to run it using scikit-learn.
import pandas as pd
import altair as alt
Finding the closest point¶
df_centers = pd.DataFrame([[-4,10],[0.2,0.8],[2,-3],[5,-1],[-4,5],[-2,-1],[-7,-1.45]],columns=["x","y"])
len(df_centers)
7
Which of the 7 points in
df_centers
is closest to (-3,3)? Which is furthest from \((-3,3)\)? Look at the Altair chart below to answer these questions.
alt.Chart(df_centers).mark_circle(size=100).encode(
x="x",
y="y"
)
Create a length 2 pandas Series
s
from the dictionary{"x":-3, "y":3}
. We will think of this pandas Series as corresponding to the point \((-3,3)\). (Usepd.Series
.)
Create a length 2 pandas Series from the 0-th row of
df_centers
. Name the resultt
. (You shouldn’t have to typepd.Series
in this problem, nor should you have to type a colon:
symbol. Useiloc
.)
The compute the square of the distance between
s
andt
. (We eventually want to find the point whose distance is closest, but it’s the exact same to find the point whose squared distance is smallest. This saves us from having to compute a square root.)
Be sure to solve this problem without typing any numbers. You should only have to type variables like s
and labels like "x"
. The correct answer is 50
.
Compute the squared distance between s
and every point in df_centers
. Don’t use any for loops. For example, you can use code like s["x"] - df_centers["x"]
. The result should be a pandas Series like the following:
0 50.0000
1 15.0800
2 61.0000
3 80.0000
4 5.0000
5 17.0000
6 35.8025
dtype: float64
What is the integer location of the closest point to
s
? Of the furthest point? Use the pandas Series methodsargmin
andargmax
(documentation).
Look at the points in
df_centers
at these locations (useiloc
). Does it match what you expected from the Altair chart above?
A function to find the closest point¶
Write a function
closest_arg
which again takes two inputs, a Series likes
and a DataFrame likedf_centers
, and which as output returns the integer location of the row containing the closest point indf_centers
tos
.
For example, if you evaluate closest_arg(s, df_centers)
with s
and df_centers
defined as above, then the output should be 4
.
K-Means clustering steps¶
Here we’re going to switch to a smaller df_centers
DataFrame, but instead of a single point s
, we are going to look at many points.
We start out by defining df_centers
.
Execute the following cell. You don’t need to make any changes.
df_centers = pd.DataFrame([[0,0], [5,0], [0,5]], columns=["x","y"])
Now we’re going to generate a lot of data points using a scikit-learn function make_blobs
. This is similar to the make_regression
function from one of this week’s videos, but instead of generating data on a line, it generates data in clusters.
The make_blobs
function accepts many different keyword arguments. Here are the ones we are using. Feel free to change these to see how the result changes.
n_samples
: the number of points to generate.n_features
: how many columns for the resulting NumPy array.random_state
: to make sure we always get the same results.cluster_std
: how far spread out each individual cluster will be. (The smaller this number, the more distinct the clusters will look.)centers
: how many clusters we want.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=100, n_features=2, random_state=5, cluster_std=1.6, centers=4)
What type of object is
X
?What type of object is
_
?What is the shape of
X
? Does it make sense?
Convert
X
into a pandas DataFrame and save it with the variable namedf_pts
. Use the column names “x” and “y”.
Define two Altair charts using the following code. In the
c_centers
chart, we are putting some keyword arguments intomark_point()
(so instead ofmark_point()
we are usingmark_point(shape="diamond", size=100, color="black", fill="black")
). This will make the centers stand out more.
c_pts = alt.Chart(df_pts).mark_circle().encode(
x="x",
y="y"
)
c_centers = alt.Chart(df_centers).mark_point(
shape="diamond",
size=100,
color="black",
fill="black"
).encode(
x="x",
y="y"
)
Display both of these charts on the same axes. Instead of using
c_pts|c_centers
(for side-by-side) orc_pts&c_centers
(for one above the other), usec_pts+c_centers
.
Think of these three diamonds as the random starting points for K-Means clustering. (We secretly know that there are four clusters, which means it would be better to use 4 center points, but in real situations we won’t know the true number of clusters, and we can use any number of center points we want.)
Find which center is closest¶
For each of the 100 points in df_pts
(each of the 100 blue circles from the Altair chart), we want to find which of the center points is closest.
For example, think of df_pts.iloc[30]
as like the s
Series from above.
df_pts.iloc[30]
If we want to know which of the black diamonds is closest to
df_pts.iloc[30]
, we can use theclosest_arg
function above.
If all your numbers are the same as mine, you should find that it’s closest to point 2 and that the point at iloc[29]
is closest to point 1. The only possible answers you should see are 0,1,2, because there are only 3 center points.
Using apply
¶
Now apply that same function to each row in
df_pts
usingapply
and a lambda function. Use code of the formdf_pts.apply(lambda s: ???, axis=???)
. Think of the variables
in this lambda function as likedf_pts.iloc[30]
from above.
The result should be a length 100 pandas Series.
Put this resulting pandas Series into a new column in
df_pts
called “closest”.
Update the
c_pts
chart from above so that the color is determined by the “closest” column. You shouldn’t have to copy-paste anything. Just usec_pts = c_pts.encode(color=???)
. Do you see how the points are now colored according to which of the black diamonds is closest?Specify a Nominal encoding data type.
Redisplay
c_pts+c_centers
.
Finding the centroids¶
Using pandas groupby
, it is surprisingly easy to find the centroid of each individual cluster. The intuition is that the coordinates of the centroid are given by the average value of the points in that group.
Evaluate
df_pts.groupby("closest").???()
. What method do we want to use to find the averages for each individual cluster?Save the result with the name
df_new_centers
.
Make a copy of
c_centers
namedc_new_centers
.Change the
c_new_centers.data
attribute to use the new DataFrame.
Display
c_pts
andc_new_centers
on the same axes. Does it look like the black diamonds are now in the center of the clusters?Do you see some points that will change color if we repeat the coloring step (because they’re now closer to a different center point)?
Summary¶
We have begun the K-Means clustering algorithm. The next step would be to re-color the points, according to which of the new center points is closest. Then we would compute the new centroids, and keep repeating this procedure until there is an iteration in which nothing changes.
Created in Deepnote