K-Means clustering 2

We’ve seen how to implement K-Means clustering using scikit-learn, but not how the algorithm actually works. The main goal today is to see what the K-Means clustering algorithm is doing.

Warm-up

  • Make a DataFrame df with two columns, “miles” and “cars”, containing the following five data points in (miles, parking spaces): (0,1), (0,5), (1,0), (1,1), (1,5).

  • For later use, make a copy of df called df2. Be sure to use .copy().

  • Using K-Means clustering, divide this df data into two clusters. Store the data in a new column called “cluster”.

  • Does the result match what you expect?

import pandas as pd
df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=["miles","cars"])
df
miles cars
0 0 1
1 0 5
2 1 0
3 1 1
4 1 5
# Another way
pd.DataFrame({"miles":[0,0,1,1,1],"cars":[1,5,0,1,5]})
miles cars
0 0 1
1 0 5
2 1 0
3 1 1
4 1 5
from sklearn.cluster import KMeans
# create/instantiate
kmeans = KMeans(n_clusters=2)
kmeans.fit(df)
KMeans(n_clusters=2)
kmeans.predict(df)
array([0, 1, 0, 0, 1], dtype=int32)
df["cluster"] = kmeans.predict(df)

Notice how the two data points with 5 cars are in one cluster, and the three data points with 0 or 1 cars are in the other cluster. (The cluster numberings themselves are random.)

df
miles cars cluster
0 0 1 0
1 0 5 1
2 1 0 0
3 1 1 0
4 1 5 1

Importance of scaling

Let’s get the sub-DataFrame with just the “miles” and “cars” columns (not the “cluster” column). The following doesn’t work: the column names need to be in a list.

df["miles","cars"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('miles', 'cars')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_39939/3269067737.py in <module>
----> 1 df["miles","cars"]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: ('miles', 'cars')
df[["miles","cars"]]
miles cars
0 0 1
1 0 5
2 1 0
3 1 1
4 1 5
df2 = df[["miles","cars"]].copy()
df2
miles cars
0 0 1
1 0 5
2 1 0
3 1 1
4 1 5

The axis=1 keyword argument says that we are changing names of columns.

df2 = df2.rename({"miles":"feet"}, axis=1).copy()
# same as df2.feet = 5280*df2.feet
df2.feet *= 5280
df2
feet cars
0 0 1
1 0 5
2 5280 0
3 5280 1
4 5280 5

Notice that df2 and df1 contain the exact same data. The only difference is the unit of measurement used. df1 uses “miles” for its first column, whereas df2 uses “feet”.

kmeans2 = KMeans(n_clusters=2)
kmeans2.fit(df2)
KMeans(n_clusters=2)
df2["cluster"] = kmeans2.predict(df2)

Now it is the “feet” column which is dominating, instead of the “cars” column. This is bad, since the only change was to the unit of measurement used. The lesson is that, unless the columns are in the same unit, we should normalize using for example StandardScaler. (We won’t do that normalization today.)

df2
feet cars cluster
0 0 1 0
1 0 5 0
2 5280 0 1
3 5280 1 1
4 5280 5 1

Summary of K-Means

On the whiteboard we described:

  • The K-Means algorithm.

  • How a clustering is evaluated (why is one cluster better than another).

Demonstrations

Here is a nice video demonstration of the K-Means clustering algorithm.

Say you want to divide the plane into different regions, depending on which point is closest. What does that division of the plane look like? Check your answer here.