Worksheet 9
Contents
Worksheet 9#
This worksheet is due Tuesday of Week 7, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
The goal of this worksheet is to apply the K-means algorithm on a real-world dataset. We will see what can happen if the scales of the features are very different.
Preparing the data#
Import the attached Spotify dataset as
df. Do not specify thena_valueskeyword argument yet.
Evaluate
is_string_dtypefrompandas.api.typeson each column indf, using the codedf.apply(???, axis=???). Notice that most of the columns contain strings. (There’s no need to use a lambda function in this case, becauseis_string_dtypeis already a function.)
Define
first_col = "Danceability"andlast_col = "Liveness".Try applying
pd.to_numericto thefirst_colcolumn indf. The error indicates what the “bad” values are in this column.
Again import the Spotify dataset as
df, but this time specify thena_valueskeyword argument, with the “bad” value that we just found.
Evaluate the
dtypesattribute ofdf. Notice how many of the columns now contain numeric values.
Drop the rows which contain missing values, using the code
df = df.dropna(axis=???).copy().
Check your answer. The DataFrame
dfshould have1545rows and23columns.
Clustering#
Instantiate a new
KMeansobject from scikit-learn’sclustermodule, specifying that theKMeansobject should produce 5 clusters. Name the resulting objectkmeans. (Notice the use of upper-case vs lower-case.)When you instantiate the object, specify the
random_statekeyword argument with your student id number. (If you are in a group, just choose one of your student id numbers).
Fit this
KMeansobject using the data from columnsfirst_coltolast_col. Usedf.loc, and remember that, unlike most slicing in Python, if you slice usingloc, the last column is also included.
Using the
predictmethod of theKMeansobject, get a corresponding cluster number for each row indf.
Make a new column named “cluster” in
dfcontaining these cluster numbers. (This step will raise a warning if you forgot to usecopyabove in thedropnastep.)
Plotting the results#
Define
colX1 = "Acousticness"andcolX2 = "Loudness".
Choose two other distinct columns from
df.loc[:, first_col:last_col].columns; call themcolY,colC.
Check that your column names really are different, by checking that the set
{colX1, colX2, colY, colC}has length 4.
Make an Altair facet chart of scatter plots, using the “cluster” column for the faceting, using
colX1for the x-axis, usingcolYfor the y-axis, and usingcolCfor the color.Include a
tooltipthat will display the “Artist” and “Song Name” for each point.Choose an interesting color scheme for the color scale.
Have the individual charts appear in different rows, unlike the default, in which they appear in different columns.
For this
rowencoding, specify an encoding data type ofQ,N,O, orT. Only one of these makes sense for a cluster number; which one?Add a title to the chart, of the form
"Student id = ???", where???gets replaced by the Student id you used for the random state.
Which cluster appears to have the fewest points in it? Verify your answer using
value_counts. Try to write code usingvalue_countsthat produces this cluster number.
Which cluster appears to have the fewest points in it? Verify your answer using
groupbyandcount. Try to write code that outputs this cluster number.
Make the exact same Altair chart as above, but switch from
colX1tocolX2for the x-axis field. The result should look very different.
Not to be turned in, but important for the quizzes and next midterm. Why does the facet chart with
colX2look so different from the facet chart withcolX1? Hint. Compute the range of each column fromfirst_coltolast_colusing the following:
df_temp = df.loc[:, first_col:last_col]
df_temp.max(axis=???) - df_temp.min(axis=???)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same file).
Submission#
Save the chart you just produced (the one using colX2 for the x-axis) as a json file using the following code, and upload that json file on Canvas. (You will need to first assign the chart to a variable name.)
with open("chartX2.json", "w") as f:
f.write(???.to_json())