Week 5, Thursday Discussion
Contents
Week 5, Thursday Discussion¶
Reminders:
Homework #4 due Tuesday 11:59pm
Quiz #4 Tuesday during last 20 minutes of discussion
Today:
Work on Homework #4
Try this worksheet, if done with homework already
Data cleaning¶
We already know a good way to import the Spotify dataset, by using na_values. In this portion of the worksheet, let’s see another way. This will give practice with slicing, with pd.to_numeric, with apply, and with lambda functions.
Import the Spotify dataset as
dfbut do not provide thena_valueskeyword argument.Define
first_col = "Danceability"andlast_col = "Liveness".Get the sub-DataFrame of
dfconsisting of the columns fromfirst_coltolast_col(includinglast_col). Get this sub-DataFrame usinglocand a slice fromfirst_coltolast_col. (Hint 1. Make sure you’re slicing the columns axis, not the rows axis. Hint 2. Unlike most slicing in Python, if you slice usingloc, the right endpoint is included.) Name the resulting DataFramedf_sub. Apply.copy()so thatdf_subis a brand new DataFrame.Check your answer. The
shapeofdf_subshould be(1556, 6)and all thedtypesindf_subshould be “object”.
Get the third column (in the Python numbering) of
df_subusingilocand name the results(for Series).Check the
typeofs. Make sure it’s really a Series.Try applying the pandas
to_numericfunction tos. What happens?Try again, using the
errorskeyword argument (documentation) so that the non-numeric strings get converted to not-a-number.Name the result
s2.Check your answer: the
typeofs2should be a Series and thedtypeofs2should be “float64”.
Try evaluating
df_sub.apply(pd.to_numeric, axis=0). What goes wrong?Using the same approach and a lambda function, try the same thing where you also coerce non-numeric strings into not-a-number. (In other words, use a mix of this
applystrategy with the aboveerrors=strategy.)Call the result
df_sub2.Check your answer. All the
dtypesindf_sub2should be “float64”.
Once
df_sub2is correct, put the values fromdf_sub2back into the original DataFramedf. (This should be relatively short. You shouldn’t have to use a for loop or anything.)Check your answer. Try evaluating
df.dtypes["Liveness"]. You should see that this column is “float64”. (Side question. What type of Python object isdf.dtypes?)
Drop the rows that contain missing values from
df.Check your answer: the resulting
dfshould be 1545 rows by 23 columns.
Clustering¶
Divide the songs in the Spotify dataset into 5 distinct clusters by using a
KMeansobject from scikit-learn. Fit theKMeansobject using the data from columnsfirst_col:last_col.Add a “cluster” column to “df” containing the results of this clustering operation.
Plotting the results¶
Evaluate the following code. How much of the code can you understand? (Some of it is new, like the
bind="legend"and thealt.condition.)Try clicking on the legend. (For some reason, clicking on clusters 1-4 works fine for me, but clicking on cluster 0 only works for me if I click between the color and the number 0.)
Notice that the tooltip does not seem to work on the bar chart. That is a bug in Altair related to interactivity. One workaround for the bug is to add a meaningless
selectionobject to the second chart. Get the tooltip to work by adding the code.add_selection(alt.selection_single())right after thetransform_filter(sel)part of the code (i.e., add thisselectionto the definition of thec2chart).Which cluster contains the most songs? Which cluster contains the fewest songs? (Warning: If you repeat this experiment, it will probably be a different cluster number which is the smallest.)
import altair as alt
sel = alt.selection_single(fields=["cluster"], bind="legend")
c1 = alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Liveness",
color = "cluster:N",
opacity = alt.condition(sel, alt.value(1), alt.value(0.1)),
tooltip = ["Song Name", "Artist"]
).add_selection(sel)
c2 = alt.Chart(df).mark_bar().encode(
y = alt.Y("count()", scale=alt.Scale(domain=(0,1600))),
tooltip = "count()"
).transform_filter(sel)
c1|c2