K-Means clustering 2
Contents
K-Means clustering 2¶
We’ve seen how to implement K-Means clustering using scikit-learn, but not how the algorithm actually works. The main goal today is to see what the K-Means clustering algorithm is doing.
Warm-up¶
Make a DataFrame
df
with two columns, “miles” and “cars”, containing the following five data points in (miles, parking spaces): (0,1), (0,5), (1,0), (1,1), (1,5).For later use, make a copy of
df
calleddf2
. Be sure to use.copy()
.Using K-Means clustering, divide this
df
data into two clusters. Store the data in a new column called “cluster”.Does the result match what you expect?
import pandas as pd
df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=["miles","cars"])
df
miles | cars | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 5 |
2 | 1 | 0 |
3 | 1 | 1 |
4 | 1 | 5 |
# Another way
pd.DataFrame({"miles":[0,0,1,1,1],"cars":[1,5,0,1,5]})
miles | cars | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 5 |
2 | 1 | 0 |
3 | 1 | 1 |
4 | 1 | 5 |
from sklearn.cluster import KMeans
# create/instantiate
kmeans = KMeans(n_clusters=2)
kmeans.fit(df)
KMeans(n_clusters=2)
kmeans.predict(df)
array([0, 1, 0, 0, 1], dtype=int32)
df["cluster"] = kmeans.predict(df)
Notice how the two data points with 5 cars are in one cluster, and the three data points with 0 or 1 cars are in the other cluster. (The cluster numberings themselves are random.)
df
miles | cars | cluster | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 5 | 1 |
2 | 1 | 0 | 0 |
3 | 1 | 1 | 0 |
4 | 1 | 5 | 1 |
Importance of scaling¶
Let’s get the sub-DataFrame with just the “miles” and “cars” columns (not the “cluster” column). The following doesn’t work: the column names need to be in a list.
df["miles","cars"]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('miles', 'cars')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_39939/3269067737.py in <module>
----> 1 df["miles","cars"]
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: ('miles', 'cars')
df[["miles","cars"]]
miles | cars | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 5 |
2 | 1 | 0 |
3 | 1 | 1 |
4 | 1 | 5 |
df2 = df[["miles","cars"]].copy()
df2
miles | cars | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 5 |
2 | 1 | 0 |
3 | 1 | 1 |
4 | 1 | 5 |
The axis=1
keyword argument says that we are changing names of columns.
df2 = df2.rename({"miles":"feet"}, axis=1).copy()
# same as df2.feet = 5280*df2.feet
df2.feet *= 5280
df2
feet | cars | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 5 |
2 | 5280 | 0 |
3 | 5280 | 1 |
4 | 5280 | 5 |
Notice that df2
and df1
contain the exact same data. The only difference is the unit of measurement used. df1
uses “miles” for its first column, whereas df2
uses “feet”.
kmeans2 = KMeans(n_clusters=2)
kmeans2.fit(df2)
KMeans(n_clusters=2)
df2["cluster"] = kmeans2.predict(df2)
Now it is the “feet” column which is dominating, instead of the “cars” column. This is bad, since the only change was to the unit of measurement used. The lesson is that, unless the columns are in the same unit, we should normalize using for example StandardScaler
. (We won’t do that normalization today.)
df2
feet | cars | cluster | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 5 | 0 |
2 | 5280 | 0 | 1 |
3 | 5280 | 1 | 1 |
4 | 5280 | 5 | 1 |
Summary of K-Means¶
On the whiteboard we described:
The K-Means algorithm.
How a clustering is evaluated (why is one cluster better than another).