Worksheet 10
Contents
Worksheet 10#
Authors (3 maximum; use your full names): BLANK
This worksheet is due Tuesday of Week 7, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)
Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.
In this worksheet, we will see a demonstration of the K-means clustering algorithm.
import numpy as np
import pandas as pd
import altair as alt
Demonstration of the K-means algorithm#
Choose two integers,
true_cluster_num
which will represent the actual number of clusters for the random data, andguess_cluster_num
, the number of clusters we will look for using K-means clustering. Also temporarily definemax_steps=3
.
Here is an example of making a 3x3 NumPy array of uniformly distributed random numbers between 0 and 1. Adapt the code so that instead, it makes a NumPy array with
guess_cluster_num
rows and two columns, of uniformly distributed random real numbers between -10 and 10.
rng = np.random.default_rng(seed=4)
starting_points = rng.random(size=(3,3))
Hints:
There is no way inside of
rng.random
to specify the range of -10 to 10. Instead first multiply and then subtract the array by appropriate numbers.Do not write the numerical value of
guess_cluster_num
anywhere. Instead writeguess_cluster_num
so that it’s easier to change later.Keep these two lines in the same cell, so that a new
default_rng
object is produced each time the cell is run. Also keep theseed=4
portion unchanged for now.
Produce clustered data using the following code. The
make_blobs
function needs to be imported fromsklearn.datasets
. We want these points to lie in the xy-plane, so what shouldn_features
be? Another way to think about it, is that we wantn_features
to match the length of each point fromstarting_points
.
X, _ = make_blobs(
n_samples=400,
centers=true_cluster_num,
n_features=???,
random_state = 1
)
Convert
X
into a pandas DataFramedf
with appropriate column names, so that the following code displays the data. (Reality check: does it look like there are about 400 points, roughly intrue_cluster_num
number of clusters?)
alt.Chart(df).mark_circle().encode(
x="x",
y="y"
)
Instantiate a new
KMeans
object from scikit-learn, specifyingn_clusters = ???
, where???
is eithertrue_cluster_num
orguess_cluster_num
. Which should it be?Also use the following keyword arguments when you instantiate the KMeans object:
max_iter=1, init=starting_points, n_init=1
The max_iter=1
says to only run one iteration of the K-means clustering algorithm. The init=starting_points
defines our initial centroid guesses (usually these are chosen randomly by scikit-learn). Typically the entire algorithm is repeated multiple times; the n_init=1
says the entire algorithm should only be run once.
Fit this KMeans object according to the data in
df
. Then predict the clusters. Add a new column todf
corresponding to the predicted clusters.
Check your answer: Color the points according to the predicted cluster values. The data will probably not be very well-clustered, because only one iteration of the K-means clustering algorithm was used.
alt.Chart(df).mark_circle().encode(
x="x",
y="y",
color="???:???"
)
The above chart showed the clusters after a single iteration of the K-means clustering algorithm. We want to put data for many different numbers of iterations into a single pandas DataFrame. Adapt the following code.
df_list = []
for i in range(1, max_steps+1):
df_temp = df[["x", "y"]].copy()
kmeans = KMeans(n_clusters = guess_cluster_num, max_iter=i, init=starting_points, n_init = 1)
??? # Fit kmeans to the data in X. Use one or more lines.
df_temp["cluster"] = ??? # the cluster values predicted by `kmeans`
df_temp["num_steps"] = ??? # How many iterations of K-means were run?
df_list.append(df_temp)
df_clusters = pd.concat(df_list, axis=???) # Should this be 0 or 1?
step_slider = alt.binding_range(min=1, max=max_steps, step=1)
step_select = alt.selection_single(
fields=['num_steps'],
bind=step_slider,
init={???: 1}, # Start the slider at 1
name="slider"
)
c1 = alt.Chart(df_clusters).mark_circle().encode(
x = "x",
y = "y",
color = ???
).transform_filter(
???
).add_selection(
???
)
c1
Evaluate
df_clusters.loc[2]
. Why are there three rows? What do the values in the “cluster” column represent? Can you recognize the change in cluster using the slider? (It’s possible your numbers might look different. If the cluster number is always the same, trydf_clusters.loc[3]
or or any numeric value.)
We also want to include the data from before any clustering is done. Put a new copy of
df[["x", "y"]]
at the top ofdf_clusters
, together with a “cluster” column of all 0s (corresponding to no clustering) and anum_steps
column of also all 0s.
Check your answer, part 1. The new
df_clusters
should have 1600 rows and 4 columns.
Check your answer, part 2. Paste the Altair code, starting at
step_slider
, from above into the following cell. Change the slider minimum value to0
and change the selection initial value to0
. If you run the code, you should start out seeing points all of the same color. When you drag the slider, colors should show up
We also want to include the current cluster centers. After the KMeans object has been fit, it will have an attribute
cluster_centers_
which contains this data. (Sample quiz question: what is the shape of thiskmeans.cluster_centers_
NumPy array?) Adapt the following code to store these cluster centers in a pandas DataFrame nameddf_centers
.
center_list = []
for i in range(0, max_steps+1):
kmeans = KMeans(n_clusters = guess_cluster_num, max_iter=i+1, init=starting_points, n_init = 1)
??? # Fit the KMeans object to the data in X.
df_temp2 = pd.DataFrame(???, columns=["x","y"]) # Put the cluster centers in here
df_temp2["num_steps"] = i
center_list.append(???)
df_centers = pd.concat(center_list, axis=???)
c2 = alt.Chart(df_centers).mark_point(size=300, filled=True, opacity=1).encode(
x = "x",
y = "y",
color = alt.value("black"),
shape = alt.value("cross"),
).transform_filter(
???
).properties(
width=400,
height=500
)
c1+c2
It would be better if the centers and the cluster colors moved in different stages.
Make a copy of
df_clusters
, calleddf_clusters2
, and subtract0.5
from thenum_steps
column ofdf_clusters2
. (Hint. Be sure you use thecopy
method.)Concatenate
df_clusters
anddf_clusters2
, one on top of the other, and name the resultdf_clusters_both
.
Make a copy of
df_centers
, calleddf_centers2
, and add0.5
to thenum_steps
column ofdf_centers2
. Again, be sure to use thecopy
method.Concatenate
df_centers
anddf_centers2
, one on top of the other, and name the resultdf_centers_both
.
Adapt the slider and chart code from above (both c1
and c2
, starting at step_slider
) and paste it below. Make the following changes.
Change the
step
for the slider from1
to0.5
.Change the DataFrames used in both charts, to
df_clusters_both
anddf_centers_both
.Your code should look like the following. (You should not include the code making
df_clusters
ordf_centers
.)
step_slider = ...
step_select = ...
c1 = ...
c2 = ...
c1+c2
If everything went right, you should see the clusters and the centers move in different stages. Take a minute to look at the demonstration and make sure you understand what is happening.
Try changing some values above, until you get a chart that you think is interesting.
Try different values of
true_cluster_num
,guess_cluster_num
,n_samples
,max_steps
. If yourdf_clusters_both
has more than 5000 rows, you can usealt.data_transformers.enable('default', max_rows=???)
to allow more rows for Altair. For this particular DataFrame, anything up to about 50,000 rows should be fine.Try different values of the
seed
for the NumPy random number generator and of therandom_state
for themake_blobs
function. You can also make the blobs more or less spread out, by including acluster_std
keyword argument in themake_blobs
function. (The bigger thecluster_std
, the more spread out the clusters will be.)You can try changing the colors of the clusters, by setting a different color scheme. You can also try different shapes, sizes, and colors for the center point markings. I believe this is the list of possible plotting shapes.
In the following markdown cell, briefly explain what you like/think is interesting about the chart you chose. (Just 1-2 sentences is fine.)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same link).
Be sure you’ve included the (full) names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.
Submission#
Using the Share & publish menu at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.
Created in Deepnote