# K-Means clustering

## Announcements

* No in-class quiz this week.  No homework due this week.
* Videos and video quizzes are posted.  Due Thursday.
* Homework for next Tuesday (Week 6) is posted.  You will have time to work on the homework during discussion section tomorrow.
* Midterms returned today.  The median score was high, 34.5/40 (good job!), so there won't be any curving.
* Midterm solutions are in the "Course Updates" section on the Canvas homepage.
* If you see a clear mistake in my grading (e.g., I added the points wrong), you must leave the exam with me today, together with a note explaining what the mistake is.

## Brief overview of Machine Learning

Machine Learning is roughly divided into two big categories:
* Supervised Learning
* Unsupervised Learning
In *supervised learning*, there is some "correct" value that you are aiming to calculate; in other words, in supervised learning, at least some of the data includes correct labels.  In *unsupervised learning*, the key difference is that the data does not include labels.

Supervised learning is itself divided into two big categories:
* Regression, in which we are seeking to compute some numerical value.
* Classification, in which we are seeking to assign discrete labels.

Examples of **regression** problems in supervised learning:
* Predict the price of a house.
* Predict the temperature at a certain time.
* Predict the number of clicks on an ad.

Examples of **classification** problems in supervised learning:
* Identify a species of penguin from some measurements.
* Determine if an email is spam or not.
* Identify someone from a photograph.

The most famous example of unsupervised learning is **clustering**.

Here is a flowchart image I downloaded from the website [GeeksforGeeks](https://www.geeksforgeeks.org/flowchart-for-basic-machine-learning-models/).  Don't take it too literally, but it nicely illustrates some different categories of machine learning.

![Flowchart](../images/flowchart.jpeg)

## Review of StandardScaler

In [None]:
import seaborn as sns

When we import the Spotify dataset, we always specify `na_values=" "`.  We don't have to do that when importing a dataset from Seaborn.

In [None]:
df = sns.load_dataset("iris")

In [None]:
df.isna()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [None]:
df.isna().any(axis=0)

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool

Notice that the default value of `axis` is `0`, so we could have not specified `axis=0` and gotten the same result.

In [None]:
help(df.any)

Help on method any in module pandas.core.generic:

any(axis=0, bool_only=None, skipna=True, level=None, **kwargs) method of pandas.core.frame.DataFrame instance
    Return whether any element is True, potentially over an axis.
    
    Returns False unless there is at least one element within a series or
    along a Dataframe axis that is True or equivalent (e.g. non-zero or
    non-empty).
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns', None}, default 0
        Indicate which axis or axes should be reduced.
    
        * 0 / 'index' : reduce the index, return a Series whose index is the
          original column labels.
        * 1 / 'columns' : reduce the columns, return a Series whose index is the
          original index.
        * None : reduce all axes, return a scalar.
    
    bool_only : bool, default None
        Include only boolean columns. If None, will attempt to use everything,
        then use only boolean data. Not implemented for Series.


In [None]:
df.isna().any()

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool

Here is a fast way to see if there are any missing values.

In [None]:
df.isna().any().any()

False

In [None]:
2+2

4

In [None]:
penguin = sns.load_dataset("penguins")

In [None]:
penguin.isna().any(axis=1)

0      False
1      False
2      False
3       True
4      False
       ...  
339     True
340    False
341    False
342    False
343    False
Length: 344, dtype: bool

In [None]:
penguin.loc[3]

species                 Adelie
island               Torgersen
bill_length_mm             NaN
bill_depth_mm              NaN
flipper_length_mm          NaN
body_mass_g                NaN
sex                        NaN
Name: 3, dtype: object

In [None]:
penguin.loc[3,"bill_depth_mm"]

nan

In [None]:
type(penguin.loc[3,"bill_depth_mm"])

numpy.float64

In [None]:
import numpy as np

In [None]:
np.isnan(penguin.loc[3,"bill_depth_mm"])

True

In [None]:
penguin.loc[3,"bill_depth_mm"] == np.nan

False

Potential for confusion: `np.nan` is considered by Python to not equal itself.

In [None]:
np.nan == np.nan

False

In [None]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Here is the general routine for working with scikit-learn.
* Import
* Instantiate (create an object)
* `fit`
* `transform` or `predict`

In [None]:
# import
from sklearn.preprocessing import StandardScaler

In [None]:
# instantiate
scaler = StandardScaler()

In [None]:
type(scaler)

sklearn.preprocessing._data.StandardScaler

The following raises an error because `df` has a non-numeric column called "species".

In [None]:
scaler.fit(df)

ValueError: could not convert string to float: 'setosa'

In [None]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [1]:
numcols = [c for c in df.columns if c != "species"]

NameError: name 'df' is not defined

In [None]:
[c for c in df.columns if not (c == "species")]

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [None]:
numcols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [None]:
# fit
scaler.fit(df[numcols])

StandardScaler()

In [None]:
# transform
df[numcols] = scaler.transform(df[numcols])

In [None]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,-0.900681,1.019004,-1.340227,-1.315444,setosa
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa
2,-1.385353,0.328414,-1.397064,-1.315444,setosa
3,-1.506521,0.098217,-1.283389,-1.315444,setosa
4,-1.021849,1.249201,-1.340227,-1.315444,setosa
...,...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832,virginica
146,0.553333,-1.282963,0.705921,0.922303,virginica
147,0.795669,-0.131979,0.819596,1.053935,virginica
148,0.432165,0.788808,0.933271,1.448832,virginica


In [None]:
df.mean(axis=0)

sepal_length   -4.736952e-16
sepal_width    -7.815970e-16
petal_length   -4.263256e-16
petal_width    -4.736952e-16
dtype: float64

In [None]:
df.std(axis=0)

sepal_length    1.00335
sepal_width     1.00335
petal_length    1.00335
petal_width     1.00335
dtype: float64

## K-Means clustering

The library scikit-learn uses a very consistent syntax, so what we do with `KMeans` should look very similar to what we did with `StandardScaler`.

In [None]:
# import
from sklearn.cluster import KMeans

In [None]:
# instantiate/create
kmeans = KMeans()

In [None]:
type(kmeans)

sklearn.cluster._kmeans.KMeans

In [None]:
# fit
kmeans.fit(df[numcols])

KMeans()

Here we put the predicted clusters into a new column called "cluster".

In [None]:
# predict
df["cluster"] = kmeans.predict(df[numcols])

In [None]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,cluster
0,-0.900681,1.019004,-1.340227,-1.315444,setosa,1
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa,6
2,-1.385353,0.328414,-1.397064,-1.315444,setosa,6
3,-1.506521,0.098217,-1.283389,-1.315444,setosa,6
4,-1.021849,1.249201,-1.340227,-1.315444,setosa,1
...,...,...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832,virginica,0
146,0.553333,-1.282963,0.705921,0.922303,virginica,7
147,0.795669,-0.131979,0.819596,1.053935,virginica,0
148,0.432165,0.788808,0.933271,1.448832,virginica,0


In [None]:
import altair as alt

In [None]:
numcols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [None]:
alt.Chart(df).mark_circle().encode(
    x = "sepal_length",
    y = "petal_length",
    color = "cluster:N"
)

Usually we will specify the number of clusters.  If you don't specify, then it will use 8 as the default value.

In [None]:
help(KMeans)

Help on class KMeans in module sklearn.cluster._kmeans:

class KMeans(sklearn.base.TransformerMixin, sklearn.base.ClusterMixin, sklearn.base.BaseEstimator)
 |  KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto')
 |  
 |  K-Means clustering.
 |  
 |  Read more in the :ref:`User Guide <k_means>`.
 |  
 |  Parameters
 |  ----------
 |  
 |  n_clusters : int, default=8
 |      The number of clusters to form as well as the number of
 |      centroids to generate.
 |  
 |  init : {'k-means++', 'random'}, callable or array-like of shape             (n_clusters, n_features), default='k-means++'
 |      Method for initialization:
 |  
 |      'k-means++' : selects initial cluster centers for k-mean
 |      clustering in a smart way to speed up convergence. See section
 |      Notes in k_init for more details.
 |  
 |      'random': choose `n_clusters` observations (rows) at random from data
 |      for the i

Here we create a new `KMeans` object, and specify that it should use 2 clusters.

In [None]:
kmeans2 = KMeans(n_clusters=2)

In [None]:
kmeans2.fit(df[numcols])

KMeans(n_clusters=2)

Here we save it as a new column.  (In lecture, I called this "pred2" instead of "cluster2".)

In [None]:
df["cluster2"] = kmeans2.predict(df[numcols])

In [None]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,cluster,pred2
0,-0.900681,1.019004,-1.340227,-1.315444,setosa,1,1
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa,6,1
2,-1.385353,0.328414,-1.397064,-1.315444,setosa,6,1
3,-1.506521,0.098217,-1.283389,-1.315444,setosa,6,1
4,-1.021849,1.249201,-1.340227,-1.315444,setosa,1,1
...,...,...,...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832,virginica,0,0
146,0.553333,-1.282963,0.705921,0.922303,virginica,7,0
147,0.795669,-0.131979,0.819596,1.053935,virginica,0,0
148,0.432165,0.788808,0.933271,1.448832,virginica,0,0


In [None]:
alt.Chart(df).mark_circle().encode(
    x = "sepal_length",
    y = "petal_length",
    color = "cluster2:N"
)

In [None]:
numcols

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [None]:
chart_list = []

for c in numcols:
    chart = alt.Chart(df).mark_circle().encode(
        x = "sepal_length",
        y = c,
        color = "cluster2:N"
    )
    chart_list.append(chart)

An example of list unpacking.

In [None]:
alt.vconcat(*chart_list)