Worksheet 9#

This worksheet is due Tuesday of Week 7, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

The goal of this worksheet is to apply the K-means algorithm on a real-world dataset. We will see what can happen if the scales of the features are very different.

Preparing the data#

  • Import the attached Spotify dataset as df. Do not specify the na_values keyword argument yet.

  • Evaluate is_string_dtype from pandas.api.types on each column in df, using the code df.apply(???, axis=???). Notice that most of the columns contain strings. (There’s no need to use a lambda function in this case, because is_string_dtype is already a function.)

  • Define first_col = "Danceability" and last_col = "Liveness".

  • Try applying pd.to_numeric to the first_col column in df. The error indicates what the “bad” values are in this column.

  • Again import the Spotify dataset as df, but this time specify the na_values keyword argument, with the “bad” value that we just found.

  • Evaluate the dtypes attribute of df. Notice how many of the columns now contain numeric values.

  • Drop the rows which contain missing values, using the code df = df.dropna(axis=???).copy().

  • Check your answer. The DataFrame df should have 1545 rows and 23 columns.

Clustering#

  • Instantiate a new KMeans object from scikit-learn’s cluster module, specifying that the KMeans object should produce 5 clusters. Name the resulting object kmeans. (Notice the use of upper-case vs lower-case.)

  • When you instantiate the object, specify the random_state keyword argument with your student id number. (If you are in a group, just choose one of your student id numbers).

  • Fit this KMeans object using the data from columns first_col to last_col. Use df.loc, and remember that, unlike most slicing in Python, if you slice using loc, the last column is also included.

  • Using the predict method of the KMeans object, get a corresponding cluster number for each row in df.

  • Make a new column named “cluster” in df containing these cluster numbers. (This step will raise a warning if you forgot to use copy above in the dropna step.)

Plotting the results#

  • Define colX1 = "Acousticness" and colX2 = "Loudness".

  • Choose two other distinct columns from df.loc[:, first_col:last_col].columns; call them colY, colC.

  • Check that your column names really are different, by checking that the set {colX1, colX2, colY, colC} has length 4.

  • Make an Altair facet chart of scatter plots, using the “cluster” column for the faceting, using colX1 for the x-axis, using colY for the y-axis, and using colC for the color.

  • Include a tooltip that will display the “Artist” and “Song Name” for each point.

  • Choose an interesting color scheme for the color scale.

  • Have the individual charts appear in different rows, unlike the default, in which they appear in different columns.

  • For this row encoding, specify an encoding data type of Q, N, O, or T. Only one of these makes sense for a cluster number; which one?

  • Add a title to the chart, of the form "Student id = ???", where ??? gets replaced by the Student id you used for the random state.

  • Which cluster appears to have the fewest points in it? Verify your answer using value_counts. Try to write code using value_counts that produces this cluster number.

  • Which cluster appears to have the fewest points in it? Verify your answer using groupby and count. Try to write code that outputs this cluster number.

  • Make the exact same Altair chart as above, but switch from colX1 to colX2 for the x-axis field. The result should look very different.

  • Not to be turned in, but important for the quizzes and next midterm. Why does the facet chart with colX2 look so different from the facet chart with colX1? Hint. Compute the range of each column from first_col to last_col using the following:

df_temp = df.loc[:, first_col:last_col]
df_temp.max(axis=???) - df_temp.min(axis=???)

Reminder#

Every group member needs to submit this on Canvas (even if you all submit the same file).

Submission#

Save the chart you just produced (the one using colX2 for the x-axis) as a json file using the following code, and upload that json file on Canvas. (You will need to first assign the chart to a variable name.)

with open("chartX2.json", "w") as f:
    f.write(???.to_json())
Created in deepnote.com Created in Deepnote