Homework

Starter code

The following code performs K-Means clustering, with 5 clusters, like what we did in class, but with a few changes:

  • Instead of starting with a real dataset, we use make_blobs from sklearn to generate random data.

  • We specify explicit starting points.

  • We specify that the algorithm should only run the procedure one time; that is from the max_iter = 1 step.

  • We also plot in black the 5 points used to assign the clusters. (So for example, look at the black point in the green blob. The points colored green are closer to that point than to any of the other four black points. The next step in the algorithm will be to move the black points to the centers of the blobs, and then to re-color the other points.)

import numpy as np
import pandas as pd
import altair as alt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=1000, centers=5, n_features=2, random_state = 1)
df = pd.DataFrame(X, columns = list("ab"))
starting_points = np.array([[0,0],[-2,0],[-4,0],[0,2],[0,4]])
kmeans = KMeans(n_clusters = 5, max_iter=1, init=starting_points, n_init = 1)
kmeans.fit(X);
df["c"] = kmeans.predict(X)
chart1 = alt.Chart(df).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c:N"
)

df_centers = pd.DataFrame(kmeans.cluster_centers_, columns = list("ab"))

chart_centers = alt.Chart(df_centers).mark_point().encode(
    x = "a",
    y = "b",
    color = alt.value("black"),
    shape = alt.value("diamond"),
)

chart1 + chart_centers

Part 1 of the assignment

Turn the above into a Streamlit app and make the following changes:

  • Give the app a title related to K-Means.

  • Include a slider that lets the user choose the number of iterations.

Not to be turned in: are you able to predict how the colors change from one iteration to the next? It’s pretty difficult because there are so many points. You could try changing 1000 to 100 above and see if it becomes easier.

Part 2 of the assignment

Deploy your app on Streamlit cloud. You can follow the instructions from Week 4.

Part 3 of the assignment

Start thinking about the final project (more details here). In particular, think about what dataset you might want to use and where you might find it, what questions you might want to ask about the dataset, and what “extra” component (not something covered in Math 10) you might want to include.

Submission

For the submission on Canvas, there should be a text entry box, where you should write in the following.

  1. Link to the shared Streamlit app. (The running app, not the source code.)

  2. What dataset(s) are you thinking of using in the project? (What is in the dataset, and where can you find it?)

  3. What questions are you considering related to this data?

  4. What “extra” component are you interested in trying to use?