# K-Means clustering 2

We've seen how to implement K-Means clustering using scikit-learn, but not how the algorithm actually works.  The main goal today is to see what the K-Means clustering algorithm is doing.

## Warm-up

* Make a DataFrame `df` with two columns, "miles" and "cars", containing the following five data points in (miles, parking spaces): (0,1), (0,5), (1,0), (1,1), (1,5).
* For later use, make a copy of `df` called `df2`.  Be sure to use `.copy()`.
* Using K-Means clustering, divide this `df` data into two clusters.  Store the data in a new column called "cluster".
* Does the result match what you expect?

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=["miles","cars"])
df

Unnamed: 0,miles,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


In [3]:
# Another way
pd.DataFrame({"miles":[0,0,1,1,1],"cars":[1,5,0,1,5]})

Unnamed: 0,miles,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


In [4]:
from sklearn.cluster import KMeans

In [5]:
# create/instantiate
kmeans = KMeans(n_clusters=2)

In [6]:
kmeans.fit(df)

KMeans(n_clusters=2)

In [7]:
kmeans.predict(df)

array([0, 1, 0, 0, 1], dtype=int32)

In [8]:
df["cluster"] = kmeans.predict(df)

Notice how the two data points with 5 cars are in one cluster, and the three data points with 0 or 1 cars are in the other cluster.  (The cluster numberings themselves are random.)

In [10]:
df

Unnamed: 0,miles,cars,cluster
0,0,1,0
1,0,5,1
2,1,0,0
3,1,1,0
4,1,5,1


## Importance of scaling

Let's get the sub-DataFrame with just the "miles" and "cars" columns (not the "cluster" column).  The following doesn't work: the column names need to be in a list.

In [11]:
df["miles","cars"]

KeyError: ('miles', 'cars')

In [12]:
df[["miles","cars"]]

Unnamed: 0,miles,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


In [13]:
df2 = df[["miles","cars"]].copy()

In [14]:
df2

Unnamed: 0,miles,cars
0,0,1
1,0,5
2,1,0
3,1,1
4,1,5


The `axis=1` keyword argument says that we are changing names of columns.

In [15]:
df2 = df2.rename({"miles":"feet"}, axis=1).copy()

In [16]:
# same as df2.feet = 5280*df2.feet
df2.feet *= 5280

In [17]:
df2

Unnamed: 0,feet,cars
0,0,1
1,0,5
2,5280,0
3,5280,1
4,5280,5


Notice that `df2` and `df1` contain the exact same data.  The only difference is the unit of measurement used.  `df1` uses "miles" for its first column, whereas `df2` uses "feet".

In [18]:
kmeans2 = KMeans(n_clusters=2)

In [19]:
kmeans2.fit(df2)

KMeans(n_clusters=2)

In [20]:
df2["cluster"] = kmeans2.predict(df2)

Now it is the "feet" column which is dominating, instead of the "cars" column.  This is bad, since the only change was to the unit of measurement used.  The lesson is that, unless the columns are in the same unit, we should normalize using for example `StandardScaler`.  (We won't do that normalization today.)

In [22]:
df2

Unnamed: 0,feet,cars,cluster
0,0,1,0
1,0,5,0
2,5280,0,1
3,5280,1,1
4,5280,5,1


## Summary of K-Means

On the whiteboard we described:
* The K-Means algorithm.
* How a clustering is evaluated (why is one cluster better than another).

## Demonstrations

Here is a nice video demonstration of the K-Means clustering algorithm.

<iframe width="560" height="315" src="https://www.youtube.com/embed/5I3Ei69I40s" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

Say you want to divide the plane into different regions, depending on which point is closest.  What does that division of the plane look like?  Check your answer [here](https://en.wikipedia.org/wiki/Voronoi_diagram#/media/File:Voronoi_growth_euclidean.gif).