Week 6, Tuesday Discussion

Week 6, Tuesday Discussion

Reminders:

  • Homework #4 due tonight at 11:59pm

  • Quiz #3 today during last 25 minutes of discussion

Today:

  • Review for Quiz #3 ~25 (first 25 minutes of discussion)

  • Quiz #3 (remainder of discussion)

Question 1: Using the following dataset, run the KMeans algorithm by hand, starting with the initial points \((3,0)\) and \((1,0)\). (How many clusters do you expect with this setup?)

Warning: Notice the scaling on the \(x\)- and \(y\)-axes!

import pandas as pd
import altair as alt
from sklearn.cluster import KMeans

df = pd.DataFrame({"x":[0,0,1,1,3,4],"y":[0,10,0,10,0,0]})
datapoints = alt.Chart(df).mark_circle(size = 100, color = "black").encode(
    x = "x",
    y = "y"
)

First Iteration of KMeans:

Let Cluster 1 be the points closest to \((3,0)\) and Cluster 2 the points closest to \((1,0)\). Then, our first clusters are as follows:

Cluster 1: \((3,0), (4,0)\)

Cluster 2: \((1,0),(1,10),(0,0),(0,10)\)

Let us now compute the centroid of each cluster. This can be accomplished by taking the component-wise average of each point in a given cluster.

We see that Cluster 1 has average \((3.5,0)\) and Cluster 2 has average \((0.5, 5)\). These two points become our new centers.

round1 = pd.DataFrame({"x":[3.5,0.5],"y":[0,5]})
average1 = alt.Chart(round1).mark_point(size = 100, color = "green").encode(
    x = "x",
    y = "y"
)

datapoints+average1

Remark: Notice that the new centers are the centers of mass of the two original clusters.

Second Iteration of KMeans:

Let Cluster 1 be the set of points closest to \((3.5,0)\) and Cluster 2 those closest to \((0.5,5)\). Then,

Cluster 1: (3,0),(4,0),(1,0),(0,0)

Cluster 2: (0,10), (1,10)

We compute the centroid again. For Cluster 1 it is \((2,0)\) and for Cluster 2 it is \((0.5,10)\).

round2 = pd.DataFrame({"x":[0.5,2],"y":[10,0]})
average2 = alt.Chart(round2).mark_point(size = 100, color = "orange").encode(
    x = "x",
    y = "y"
)

datapoints+average1+average2

Notice! If we were to repeat the process again, our clusters would not change. This is how we know to terminate the process. We conclude that the clusters we found in the second iteration are the final result.

Let’s see how well we did….

kmeans = KMeans(n_clusters=2)

kmeans.fit(df)
df["Clusters"] = kmeans.predict(df)

datapoints.encode(
    color = "Clusters:N"
)

Question 2:

Consider df as below. What will be the result of the following?

df.apply(lambda f: f["y"] + 2**f["x"], axis = 1)

df.apply(lambda g: g[0] + g[5], axis = 0)

df2 = df.drop("Clusters", axis = 1).copy()
df2
x y
0 0 0
1 0 10
2 1 0
3 1 10
4 3 0
5 4 0

Before we answer, let’s take a look at the documentation for apply. In particular, pay attention to the axis argument.

df2.apply(lambda f: f["y"] + 2**f["x"], axis = 1 )
0     1
1    11
2     2
3    12
4     8
5    16
dtype: int64
df2.apply(lambda g: g[0] + g[5], axis = 0)
x    4
y    0
dtype: int64
type(df2.apply(lambda g: g[0] + g[5], axis = 0))
pandas.core.series.Series