Worksheet 9
Contents
Worksheet 9#
This worksheet is due Tuesday of Week 7, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
The goal of this worksheet is to apply the K-means algorithm on a real-world dataset. We will see what can happen if the scales of the features are very different.
Preparing the data#
Import the attached Spotify dataset as
df
. Do not specify thena_values
keyword argument yet.
Evaluate
is_string_dtype
frompandas.api.types
on each column indf
, using the codedf.apply(???, axis=???)
. Notice that most of the columns contain strings. (There’s no need to use a lambda function in this case, becauseis_string_dtype
is already a function.)
Define
first_col = "Danceability"
andlast_col = "Liveness"
.Try applying
pd.to_numeric
to thefirst_col
column indf
. The error indicates what the “bad” values are in this column.
Again import the Spotify dataset as
df
, but this time specify thena_values
keyword argument, with the “bad” value that we just found.
Evaluate the
dtypes
attribute ofdf
. Notice how many of the columns now contain numeric values.
Drop the rows which contain missing values, using the code
df = df.dropna(axis=???).copy()
.
Check your answer. The DataFrame
df
should have1545
rows and23
columns.
Clustering#
Instantiate a new
KMeans
object from scikit-learn’scluster
module, specifying that theKMeans
object should produce 5 clusters. Name the resulting objectkmeans
. (Notice the use of upper-case vs lower-case.)When you instantiate the object, specify the
random_state
keyword argument with your student id number. (If you are in a group, just choose one of your student id numbers).
Fit this
KMeans
object using the data from columnsfirst_col
tolast_col
. Usedf.loc
, and remember that, unlike most slicing in Python, if you slice usingloc
, the last column is also included.
Using the
predict
method of theKMeans
object, get a corresponding cluster number for each row indf
.
Make a new column named “cluster” in
df
containing these cluster numbers. (This step will raise a warning if you forgot to usecopy
above in thedropna
step.)
Plotting the results#
Define
colX1 = "Acousticness"
andcolX2 = "Loudness"
.
Choose two other distinct columns from
df.loc[:, first_col:last_col].columns
; call themcolY
,colC
.
Check that your column names really are different, by checking that the set
{colX1, colX2, colY, colC}
has length 4.
Make an Altair facet chart of scatter plots, using the “cluster” column for the faceting, using
colX1
for the x-axis, usingcolY
for the y-axis, and usingcolC
for the color.Include a
tooltip
that will display the “Artist” and “Song Name” for each point.Choose an interesting color scheme for the color scale.
Have the individual charts appear in different rows, unlike the default, in which they appear in different columns.
For this
row
encoding, specify an encoding data type ofQ
,N
,O
, orT
. Only one of these makes sense for a cluster number; which one?Add a title to the chart, of the form
"Student id = ???"
, where???
gets replaced by the Student id you used for the random state.
Which cluster appears to have the fewest points in it? Verify your answer using
value_counts
. Try to write code usingvalue_counts
that produces this cluster number.
Which cluster appears to have the fewest points in it? Verify your answer using
groupby
andcount
. Try to write code that outputs this cluster number.
Make the exact same Altair chart as above, but switch from
colX1
tocolX2
for the x-axis field. The result should look very different.
Not to be turned in, but important for the quizzes and next midterm. Why does the facet chart with
colX2
look so different from the facet chart withcolX1
? Hint. Compute the range of each column fromfirst_col
tolast_col
using the following:
df_temp = df.loc[:, first_col:last_col]
df_temp.max(axis=???) - df_temp.min(axis=???)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same file).
Submission#
Save the chart you just produced (the one using colX2
for the x-axis) as a json file using the following code, and upload that json file on Canvas. (You will need to first assign the chart to a variable name.)
with open("chartX2.json", "w") as f:
f.write(???.to_json())