Worksheet 10#

Authors (3 maximum; use your full names): BLANK

This worksheet is due Tuesday of Week 7, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)

Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.

In this worksheet, we will see a demonstration of the K-means clustering algorithm.

import numpy as np
import pandas as pd
import altair as alt

Demonstration of the K-means algorithm#

  • Choose two integers, true_cluster_num which will represent the actual number of clusters for the random data, and guess_cluster_num, the number of clusters we will look for using K-means clustering. Also temporarily define max_steps=3.

  • Here is an example of making a 3x3 NumPy array of uniformly distributed random numbers between 0 and 1. Adapt the code so that instead, it makes a NumPy array with guess_cluster_num rows and two columns, of uniformly distributed random real numbers between -10 and 10.

rng = np.random.default_rng(seed=4)
starting_points = rng.random(size=(3,3))

Hints:

  1. There is no way inside of rng.random to specify the range of -10 to 10. Instead first multiply and then subtract the array by appropriate numbers.

  2. Do not write the numerical value of guess_cluster_num anywhere. Instead write guess_cluster_num so that it’s easier to change later.

  3. Keep these two lines in the same cell, so that a new default_rng object is produced each time the cell is run. Also keep the seed=4 portion unchanged for now.

  • Produce clustered data using the following code. The make_blobs function needs to be imported from sklearn.datasets. We want these points to lie in the xy-plane, so what should n_features be? Another way to think about it, is that we want n_features to match the length of each point from starting_points.

X, _ = make_blobs(
                    n_samples=400, 
                    centers=true_cluster_num, 
                    n_features=???, 
                    random_state = 1
                )
  • Convert X into a pandas DataFrame df with appropriate column names, so that the following code displays the data. (Reality check: does it look like there are about 400 points, roughly in true_cluster_num number of clusters?)

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)
  • Instantiate a new KMeans object from scikit-learn, specifying n_clusters = ???, where ??? is either true_cluster_num or guess_cluster_num. Which should it be?

  • Also use the following keyword arguments when you instantiate the KMeans object:

max_iter=1, init=starting_points, n_init=1

The max_iter=1 says to only run one iteration of the K-means clustering algorithm. The init=starting_points defines our initial centroid guesses (usually these are chosen randomly by scikit-learn). Typically the entire algorithm is repeated multiple times; the n_init=1 says the entire algorithm should only be run once.

  • Fit this KMeans object according to the data in df. Then predict the clusters. Add a new column to df corresponding to the predicted clusters.

  • Check your answer: Color the points according to the predicted cluster values. The data will probably not be very well-clustered, because only one iteration of the K-means clustering algorithm was used.

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y",
    color="???:???"
)
  • The above chart showed the clusters after a single iteration of the K-means clustering algorithm. We want to put data for many different numbers of iterations into a single pandas DataFrame. Adapt the following code.

df_list = []

for i in range(1, max_steps+1):
    df_temp = df[["x", "y"]].copy()
    kmeans = KMeans(n_clusters = guess_cluster_num, max_iter=i, init=starting_points, n_init = 1)
    ??? # Fit kmeans to the data in X.  Use one or more lines.
    df_temp["cluster"] = ??? # the cluster values predicted by `kmeans`
    df_temp["num_steps"] = ??? # How many iterations of K-means were run?
    df_list.append(df_temp)

df_clusters = pd.concat(df_list, axis=???) # Should this be 0 or 1?

step_slider = alt.binding_range(min=1, max=max_steps, step=1)
step_select = alt.selection_single(
                                        fields=['num_steps'], 
                                        bind=step_slider, 
                                        init={???: 1}, # Start the slider at 1
                                        name="slider"
                                    )

c1 = alt.Chart(df_clusters).mark_circle().encode(
    x = "x",
    y = "y",
    color = ???
).transform_filter(
    ???
).add_selection(
    ???
)

c1
  • Evaluate df_clusters.loc[2]. Why are there three rows? What do the values in the “cluster” column represent? Can you recognize the change in cluster using the slider? (It’s possible your numbers might look different. If the cluster number is always the same, try df_clusters.loc[3] or or any numeric value.)

  • We also want to include the data from before any clustering is done. Put a new copy of df[["x", "y"]] at the top of df_clusters, together with a “cluster” column of all 0s (corresponding to no clustering) and a num_steps column of also all 0s.

  • Check your answer, part 1. The new df_clusters should have 1600 rows and 4 columns.

  • Check your answer, part 2. Paste the Altair code, starting at step_slider, from above into the following cell. Change the slider minimum value to 0 and change the selection initial value to 0. If you run the code, you should start out seeing points all of the same color. When you drag the slider, colors should show up

  • We also want to include the current cluster centers. After the KMeans object has been fit, it will have an attribute cluster_centers_ which contains this data. (Sample quiz question: what is the shape of this kmeans.cluster_centers_ NumPy array?) Adapt the following code to store these cluster centers in a pandas DataFrame named df_centers.

center_list = []

for i in range(0, max_steps+1):
    kmeans = KMeans(n_clusters = guess_cluster_num, max_iter=i+1, init=starting_points, n_init = 1)
    ??? # Fit the KMeans object to the data in X.
    df_temp2 = pd.DataFrame(???, columns=["x","y"]) # Put the cluster centers in here
    df_temp2["num_steps"] = i
    center_list.append(???)

df_centers = pd.concat(center_list, axis=???)

c2 = alt.Chart(df_centers).mark_point(size=300, filled=True, opacity=1).encode(
    x = "x",
    y = "y",
    color = alt.value("black"),
    shape = alt.value("cross"),
).transform_filter(
    ???
).properties(
    width=400,
    height=500
)

c1+c2

It would be better if the centers and the cluster colors moved in different stages.

  • Make a copy of df_clusters, called df_clusters2, and subtract 0.5 from the num_steps column of df_clusters2. (Hint. Be sure you use the copy method.)

  • Concatenate df_clusters and df_clusters2, one on top of the other, and name the result df_clusters_both.

  • Make a copy of df_centers, called df_centers2, and add 0.5 to the num_steps column of df_centers2. Again, be sure to use the copy method.

  • Concatenate df_centers and df_centers2, one on top of the other, and name the result df_centers_both.

Adapt the slider and chart code from above (both c1 and c2, starting at step_slider) and paste it below. Make the following changes.

  • Change the step for the slider from 1 to 0.5.

  • Change the DataFrames used in both charts, to df_clusters_both and df_centers_both.

  • Your code should look like the following. (You should not include the code making df_clusters or df_centers.)

step_slider = ...
step_select = ...
c1 = ...
c2 = ...
c1+c2

If everything went right, you should see the clusters and the centers move in different stages. Take a minute to look at the demonstration and make sure you understand what is happening.

Try changing some values above, until you get a chart that you think is interesting.

  • Try different values of true_cluster_num, guess_cluster_num, n_samples, max_steps. If your df_clusters_both has more than 5000 rows, you can use alt.data_transformers.enable('default', max_rows=???) to allow more rows for Altair. For this particular DataFrame, anything up to about 50,000 rows should be fine.

  • Try different values of the seed for the NumPy random number generator and of the random_state for the make_blobs function. You can also make the blobs more or less spread out, by including a cluster_std keyword argument in the make_blobs function. (The bigger the cluster_std, the more spread out the clusters will be.)

  • You can try changing the colors of the clusters, by setting a different color scheme. You can also try different shapes, sizes, and colors for the center point markings. I believe this is the list of possible plotting shapes.

  • In the following markdown cell, briefly explain what you like/think is interesting about the chart you chose. (Just 1-2 sentences is fine.)

Reminder#

Every group member needs to submit this on Canvas (even if you all submit the same link).

Be sure you’ve included the (full) names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.

Submission#

Using the Share & publish menu at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote