Worksheet 10
Contents
Worksheet 10#
Authors (3 maximum; use your full names): BLANK
This worksheet is due Tuesday of Week 7, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)
Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.
In this worksheet, we will see a demonstration of the K-means clustering algorithm.
import numpy as np
import pandas as pd
import altair as alt
Demonstration of the K-means algorithm#
Choose two integers,
true_cluster_numwhich will represent the actual number of clusters for the random data, andguess_cluster_num, the number of clusters we will look for using K-means clustering. Also temporarily definemax_steps=3.
Here is an example of making a 3x3 NumPy array of uniformly distributed random numbers between 0 and 1. Adapt the code so that instead, it makes a NumPy array with
guess_cluster_numrows and two columns, of uniformly distributed random real numbers between -10 and 10.
rng = np.random.default_rng(seed=4)
starting_points = rng.random(size=(3,3))
Hints:
There is no way inside of
rng.randomto specify the range of -10 to 10. Instead first multiply and then subtract the array by appropriate numbers.Do not write the numerical value of
guess_cluster_numanywhere. Instead writeguess_cluster_numso that it’s easier to change later.Keep these two lines in the same cell, so that a new
default_rngobject is produced each time the cell is run. Also keep theseed=4portion unchanged for now.
Produce clustered data using the following code. The
make_blobsfunction needs to be imported fromsklearn.datasets. We want these points to lie in the xy-plane, so what shouldn_featuresbe? Another way to think about it, is that we wantn_featuresto match the length of each point fromstarting_points.
X, _ = make_blobs(
n_samples=400,
centers=true_cluster_num,
n_features=???,
random_state = 1
)
Convert
Xinto a pandas DataFramedfwith appropriate column names, so that the following code displays the data. (Reality check: does it look like there are about 400 points, roughly intrue_cluster_numnumber of clusters?)
alt.Chart(df).mark_circle().encode(
x="x",
y="y"
)
Instantiate a new
KMeansobject from scikit-learn, specifyingn_clusters = ???, where???is eithertrue_cluster_numorguess_cluster_num. Which should it be?Also use the following keyword arguments when you instantiate the KMeans object:
max_iter=1, init=starting_points, n_init=1
The max_iter=1 says to only run one iteration of the K-means clustering algorithm. The init=starting_points defines our initial centroid guesses (usually these are chosen randomly by scikit-learn). Typically the entire algorithm is repeated multiple times; the n_init=1 says the entire algorithm should only be run once.
Fit this KMeans object according to the data in
df. Then predict the clusters. Add a new column todfcorresponding to the predicted clusters.
Check your answer: Color the points according to the predicted cluster values. The data will probably not be very well-clustered, because only one iteration of the K-means clustering algorithm was used.
alt.Chart(df).mark_circle().encode(
x="x",
y="y",
color="???:???"
)
The above chart showed the clusters after a single iteration of the K-means clustering algorithm. We want to put data for many different numbers of iterations into a single pandas DataFrame. Adapt the following code.
df_list = []
for i in range(1, max_steps+1):
df_temp = df[["x", "y"]].copy()
kmeans = KMeans(n_clusters = guess_cluster_num, max_iter=i, init=starting_points, n_init = 1)
??? # Fit kmeans to the data in X. Use one or more lines.
df_temp["cluster"] = ??? # the cluster values predicted by `kmeans`
df_temp["num_steps"] = ??? # How many iterations of K-means were run?
df_list.append(df_temp)
df_clusters = pd.concat(df_list, axis=???) # Should this be 0 or 1?
step_slider = alt.binding_range(min=1, max=max_steps, step=1)
step_select = alt.selection_single(
fields=['num_steps'],
bind=step_slider,
init={???: 1}, # Start the slider at 1
name="slider"
)
c1 = alt.Chart(df_clusters).mark_circle().encode(
x = "x",
y = "y",
color = ???
).transform_filter(
???
).add_selection(
???
)
c1
Evaluate
df_clusters.loc[2]. Why are there three rows? What do the values in the “cluster” column represent? Can you recognize the change in cluster using the slider? (It’s possible your numbers might look different. If the cluster number is always the same, trydf_clusters.loc[3]or or any numeric value.)
We also want to include the data from before any clustering is done. Put a new copy of
df[["x", "y"]]at the top ofdf_clusters, together with a “cluster” column of all 0s (corresponding to no clustering) and anum_stepscolumn of also all 0s.
Check your answer, part 1. The new
df_clustersshould have 1600 rows and 4 columns.
Check your answer, part 2. Paste the Altair code, starting at
step_slider, from above into the following cell. Change the slider minimum value to0and change the selection initial value to0. If you run the code, you should start out seeing points all of the same color. When you drag the slider, colors should show up
We also want to include the current cluster centers. After the KMeans object has been fit, it will have an attribute
cluster_centers_which contains this data. (Sample quiz question: what is the shape of thiskmeans.cluster_centers_NumPy array?) Adapt the following code to store these cluster centers in a pandas DataFrame nameddf_centers.
center_list = []
for i in range(0, max_steps+1):
kmeans = KMeans(n_clusters = guess_cluster_num, max_iter=i+1, init=starting_points, n_init = 1)
??? # Fit the KMeans object to the data in X.
df_temp2 = pd.DataFrame(???, columns=["x","y"]) # Put the cluster centers in here
df_temp2["num_steps"] = i
center_list.append(???)
df_centers = pd.concat(center_list, axis=???)
c2 = alt.Chart(df_centers).mark_point(size=300, filled=True, opacity=1).encode(
x = "x",
y = "y",
color = alt.value("black"),
shape = alt.value("cross"),
).transform_filter(
???
).properties(
width=400,
height=500
)
c1+c2
It would be better if the centers and the cluster colors moved in different stages.
Make a copy of
df_clusters, calleddf_clusters2, and subtract0.5from thenum_stepscolumn ofdf_clusters2. (Hint. Be sure you use thecopymethod.)Concatenate
df_clustersanddf_clusters2, one on top of the other, and name the resultdf_clusters_both.
Make a copy of
df_centers, calleddf_centers2, and add0.5to thenum_stepscolumn ofdf_centers2. Again, be sure to use thecopymethod.Concatenate
df_centersanddf_centers2, one on top of the other, and name the resultdf_centers_both.
Adapt the slider and chart code from above (both c1 and c2, starting at step_slider) and paste it below. Make the following changes.
Change the
stepfor the slider from1to0.5.Change the DataFrames used in both charts, to
df_clusters_bothanddf_centers_both.Your code should look like the following. (You should not include the code making
df_clustersordf_centers.)
step_slider = ...
step_select = ...
c1 = ...
c2 = ...
c1+c2
If everything went right, you should see the clusters and the centers move in different stages. Take a minute to look at the demonstration and make sure you understand what is happening.
Try changing some values above, until you get a chart that you think is interesting.
Try different values of
true_cluster_num,guess_cluster_num,n_samples,max_steps. If yourdf_clusters_bothhas more than 5000 rows, you can usealt.data_transformers.enable('default', max_rows=???)to allow more rows for Altair. For this particular DataFrame, anything up to about 50,000 rows should be fine.Try different values of the
seedfor the NumPy random number generator and of therandom_statefor themake_blobsfunction. You can also make the blobs more or less spread out, by including acluster_stdkeyword argument in themake_blobsfunction. (The bigger thecluster_std, the more spread out the clusters will be.)You can try changing the colors of the clusters, by setting a different color scheme. You can also try different shapes, sizes, and colors for the center point markings. I believe this is the list of possible plotting shapes.
In the following markdown cell, briefly explain what you like/think is interesting about the chart you chose. (Just 1-2 sentences is fine.)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same link).
Be sure you’ve included the (full) names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.
Submission#
Using the Share & publish menu at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.